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Preface 



Concepts like ubiquitous computing and ambient intelligence that exploit in- 
creasingly interconnected networks and mobility put new requirements on data 
management. An important element in the connected world is that data will 
be accessible anytime anywhere. This also has its downside in that it becomes 
easier to get unauthorized data access. Furthermore, it will become easier to 
collect, store, and search personal information and endanger people’s privacy. 
As a result security and privacy of data becomes more and more of an issue. 
Therefore, secure data management, which is also privacy-enhanced, turns out 
to be a challenging goal that will also seriously influence the acceptance of ubiq- 
uitous computing and ambient intelligence concepts by society. 

With the above in mind, we organized the SDM 2004 workshop to initiate and 
promote secure data management as one of the important interdisciplinary re- 
search fields that brings together people from the security research community 
and the data management research community. The call for papers attracted 
28 submissions both from universities and industry. The program committee 
selected 15 research papers for presentation at the workshop. The technical con- 
tributions presented at the SDM workshop are collected in this volume, which, 
we hope, will serve as a valuable research and reference book in your professional 
life. 

The volume is divided into four topical parts. The first section focuses on ac- 
cessing encrypted data. The first three papers of this section concentrate on the 
interesting problem of searching in encrypted data, while the last paper discusses 
the integrity of data that is shared or exchanged on the World-Wide Web. The 
second section addresses private data management, as well as management of 
private (personal) data. Research topics of this section include management of 
personal data with P3P for Internet services, privacy in digital rights manage- 
ment, as well as privacy-preserving data mining. The third section focuses on 
access control, which remains an important area of interest for database security 
researchers. Finally, two papers in the fourth section discuss specific topics within 
database security: release control of sensitive associations stored in databases, 
and a method to defend against copying a database as a whole. 
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Secure and Privacy Preserving Outsourcing 
of Tree Structured Data 



Ping Lin and K. Selguk Candan 
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Abstract. With the increasing use of web services, many new challenges 
concerning data security are becoming critical. Data or applications can 
now be outsourced to powerful remote servers, which are able to pro- 
vide services on behalf of the owners. Unfortunately, such hosts may not 
always be trustworthy. In [1,2], we presented a one-server computation- 
ally private tree traversal technique, which allows clients to outsource 
tree-structured data. In this paper, we extend this protocol to prevent a 
polynomial time server with large memory to use correlations in client 
queries and in data structures to learn private information about queries 
and data. We show that, when the proposed techniques are used, com- 
putational privacy is achieved even for non-uniformly distributed node 
accesses that are common in real databases. 

Keywords: Search on Encrypted Data, Tree structured data (XML) 
Security, Private Information Retrieval 



1 Introduction 

In web and mobile computing, clients usually do not have sufficient computation 
power or memory and they need remote servers to do the computation or store 
data for them. Publishing data on remote servers helps improve service avail- 
ability and system scalability, reducing clients’ burden of managing data. With 
their computation power and large memory, such remote servers are called data 
stores or oracles. Typically, as the entities different than the data owners, these 
data stores can not be fully trusted, for they may be malicious and can be driven 
by their own benefits to make illegal use of information stored on them. Hence 
data outsourcing introduces security concerns that are different from traditional 
database service which always assumes that database is honest and it is the il- 
legal users access that should protect the data from. These concerns have to be 
addressed effectively to convince customers that outsourcing their IT needs is a 
viable alternative to deploying complex infrastructures locally. 

1.1 Problem Statement 

Special security concerns with respect to data outsourcing can be categorized 
into content privacy and access privacy [12]. Clients with sensitive data (e.g., 
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personal identifiable data) outsourced to untrusted host may require that their 
data be protected from such data storage oracles. This is defined as content- 
privacy [12] and leads to encrypted database research [7], in which sensitive 
data is encrypted, so the content is hidden from the database. Sometimes not 
only the data outsourced to a data store, but also queries are of value and a 
malicious data store can make use of such information for its own benefits. This 
is defined as access privacy [12]. Access privacy leads to private information 
retrieval [13] research, which studies how to let users retrieve information from 
database without leaking (even to the server) the identity and the location of 
the retrieved data item. 

Access privacy and content privacy are not independent. If the data has some 
structure, plain access may reveal this structure, hence impair content privacy. 
For example, if the data is in the form of an XML tree (as described in the next 
subsection), without proper methods to protect access privacy, the path along 
which to find the target data will be revealed, so will be the whole structure of 
the data tree, which impairs the content privacy. Hence to protect both content 
privacy and access privacy, we need to hide the structure of the data. 

In this paper, we address secure outsourcing of tree structured data, such 
as XML documents. To be specific, we address hiding of tree-structured data 
and queries on this data. XML documents [8] have tree-like structures. XML 
has become a cle facto standard for data exchange and representation over the 
Internet [9]. Some work has been done on selective and authentic untrusted 
third-party distribution of XML documents [3, 9-11]. The work focuses on access 
control and authentication of document (i.e., query result) source and content. 
With more and more data stored in XML documents, techniques to hide tree 
structures (the content and structure of XML documents) from untrusted data 
stores are in great need. In an XML database, a query is often given in the 
form of tree paths, like XQuery. To hide XML queries and the structure of XML 
documents, clients need to traverse XML trees in a hidden way. Other frequently 
used tree-structures include indexes that are often built for convenient access to 
data. However, most index structures closely reflect the distribution of the data. 
Thus, in order to hide the data and data distribution from the database, tree 
structure hiding techniques must be adopted to protect index trees from oracles. 
Though the techniques we present in this paper can also be used for hiding 
index structure accesses, here we do not focus on this application. Recent work 
in privacy-preserving index structures includes [4] . 

In [1,2], we proposed a protocol for hiding traversals of trees from ora- 
cles. Noticing that existing private information retrieval techniques require ei- 
ther heavy replication of the database onto multiple non-communicating servers 
or large communication costs or computation costs [13], we provided an one- 
server tree-traversal protocol that provides a balance between the communica- 
tion/computation cost and security requirements. To protect the client from the 
malicious data store, some tasks (such as traversing the tree-structures) are per- 
formed interactively. Client responsibilities include encryption and decryption of 
the data received from the data store during the traversal of the tree. The en- 
cryption capability required at client side can be achieved by assistant hardware 
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equipments, such as smartcards that are cheap (generally no more than sev- 
eral dollars) and now commonly used in mobile environments [20] . We analyzed 
the overhead incurred by proposed technique, including communication cost, 
encryption/decryption costs, and concurrency overhead. Since [13] has argued 
that information-theoretical private information retrieval cannot be achieved 
without a significant communication overhead, our proposed method minimizes 
those costs in a computational hiding sense. 

In this paper, we build on the approaches proposed in [1,2] to develop a 
computationally secure protocol for hiding correlated accesses to tree-structured 
outsourced data. We find that if node accesses are uniformly distributed, the 
original protocol achieves computational privacy [2]. To ensure computational 
privacy in face of non-uniformly distributed and correlated node accesses, which 
actually occur in real scenarios, we propose a systematic way to enhance the 
preliminary protocol so that from the server’s view, node accesses are uniformly 
distributed. 

2 Related Work 

Besides the common data encryption methods that hide sensitive data, there are 
various efforts to hide other kinds of secret information from untrusted servers. 
Basic methods to protect content privacy include database encryption, where 
critical data such as credit card number can be encrypted. DBMS suppliers such 
as Oracle and DB2 have provided encryption functionality. Bertino and Ferrari 
[10] have studied how to protect sensitive XML data content from different en- 
tities by performing differentiated encryption of various portions using multiple 
keys. Hacigumiis, H. et al. [7] have studied how to execute SQL over encrypted 
data. Other recent work on querying encrypted data includes [5] and [6]. 

Sensitive information about data may include users queries about data. Dif- 
ferent from traditional database security which deals with preventing, detecting, 
and deterring improper disclosure or modification of information in databases, 
private information retrieval (PIR) aims to let users query a database without 
leaking to the database what data is queried. 

The basic idea behind any information theoretic PIR scheme is to replicate 
the database to several non-communicating servers and pose randomized queries 
to each server so that from the server’s view those queries are independent of 
the target but the user can reconstruct the target data from query results. The 
privacy guarantee lies in that even computationally un-bounded server can not 
tell the difference between any two communications for different targets. [13] 
showed that if one copy of database is used, the only way to hide the query 
in the information theoretic sense is to send the whole database to the user. 
In order to reduce even one bit in communication between the server and the 
user, replication of the whole database is required. Hence information theoretic 
privacy techniques require multiple data copies and cause heavy communication 
overheads. 

In order to achieve practical use, communication cost and the number of 
replicas need to be reduced. Ambainis proposed a fc-server scheme that requires 
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0(n 2k 1 ) communication [14] (where n is the database size). This result is fur- 

y—\ / loglogk \ 

tlrer improved to n y kl °n k 1 by Beimel et al. [15]. This is the best known fc-server 
information theoretic private information retrieval scheme so far. However, to 
achieve communication that is subpolynomial in the size of the database, more 
than a constant number of servers are needed [13]. In real world, database repli- 
cation is not a preferable solution. It may not be possible to prevent servers 
from communicating with each other. In computationally private information 
retrieval (CPIR) schemes, therefore, the user’s privacy requirement is relaxed so 
that any two communications for different targets are indistinguishable to any 
polynomial time server. 

CPIR schemes are built on cryptographic assumptions, which enable fur- 
ther reduction of the communication and the number of replicas. If a one-way 
function exists, then there is a 2-server scheme, such that the communication 
is 0(n e ) for any e > 0 [16]. Under the Quadratic Residuosity Assumption, a 
one-server scheme can be constructed with sub-polynomial communication [17]. 
Under the 0-hiding Assumption, a one-server scheme with a poly- logarithmic 
communication can be achieved [18]. Based on Paillier cryptosystem which is 
secure under the Composite Residuosity Assumption, Chang [19] proposed a 
one-server scheme with logarithmic communication overhead which is the opti- 
mal for CPIR. Despite the reduced communication overhead, however, all above 
CPIR schemes [16-19] suffer from heavy computation cost (linear in size of the 
database size) at both the client and server sides. Smith et al. [12] employed the 
secure processor technique to achieve computational privacy by embedding a 
secure processor at the malicious server, let the clients encrypt their queries and 
let the secure processor decrypt the queries and read the database to retrieve 
the targets. Since the whole database should be read into the secure processor 
to hide the query from the server, cost linear in the database size is still incurred 
at the server side. 

Most existing information theoretic PIR and computationally PIR schemes 
are built on binary data model, making it hard to be applied to real database. 
In the next section, we provide an overview of our previous work which enables 
a one-server CPIR scheme that can be easily applied to real data model and 
only involves moderate and adjustable computations and communications to 
gain content privacy and access privacy [1,2]. 



3 Background: Private Tree Traversal 

with Access Redundancy and Node Swapping 

In [1] , we proposed a preliminary protocol to hide traversal paths on tree struc- 
tured data. The protocol is a one server protocol. Content privacy is guaranteed 
through encryption of the tree nodes (data and pointers) before outsourcing, 
hence the host can not know the data content. Access Privacy is achieved by 
novel access redundancy and node swapping techniques. The data storage space 
is divided into units of nodes (called physical nodes) which may contain tree 
nodes (called data nodes) or have no content (empty nodes) and is organized 
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Fig. 1 . (a,b,c) Leakage of the position of root node of tree structure as a result of 
repeated accesses and (d,e,f) node swapping eliminates leakages 



into a multi-level structure with each level storing a corresponding level of the 
tree respectively. Whenever a client wants to retrieve a tree node, besides the 
target node, it asks a set of random nodes from the level. We define the set of 
nodes the client retrieves in order to get the target as redundancy set. Hence if 
the size of the redundancy set is to, the probability for the server to have a cor- 
rect guess of the target is the probability for the server to have a correct guess 
of parent-child relationship is if the depth of the tree is l, the probability 
for the server to find the traversal path is . The definition of redundancy set 
can be extended to contain multiple target nodes. If there are k targets in the 
redundancy set, the probability for the server to have correct guess of the targets 
is the probability for the server to have a correct guess of the parent-child 
relationships is ^ w x fc , ; and the probability for the server to have correct guess 
of traversal paths is ( C ^)ix( k \)i - 1 ■ 

Unfortunately, we find that repeated access for the same target (e.g. the root) 
may reveal the target, for the target is always in the intersection of the related 
redundancy sets. Figures 1(a), (b), (c) give an example showing how repeated 
access for the root may reveal the physical location of the root. In Figure 1, 
large circles represent redundancy sets and small circles represent the root. In 
[1, 2] we addressed this problem by requiring that each redundancy set should at 
least include one randomly selected empty node. After each retrieval the client 
should swap the target with the empty node, re- encrypt the redundancy set 
using a different key or encryption scheme (which is essential in order to hide 
the location of target and the empty) and then write the set back into the data 
storage space at the server. This is called node swapping. Node swapping ensures 
that after each retrieval, the target moves to a random position in the data store, 
hence making the distribution of data in the data storage space random, i.e. , data 
keep randomly moving as queries are posed and answered. With node swapping, 
any correct guess of the target is transient and hence the information leaked by 
intersections is reduced (Figures 1(d), (e), (f)). 

Since data nodes are constantly swapped, the parent-child links have to be 
properly maintained. The physical location of the root is maintained in a fixed 
encrypted special node snode [1, 2] which is the entry to the data store. In [1, 
2], we developed highly concurrent techniques to maintain parent-child links as 
children nodes are swapped. Based on this, we have implemented a deadlock 
free private tree traversal algorithm [1,2] to enable a client to query the tree by 
traversing the tree and locating the required data node. The total cost (includ- 
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Fig. 2. Intersections may leak identical queries 



ing communication cost, read/ write cost and encryption/decryption cost) of the 
protocol is a function of the redundancy size m, the total data points num and 
the node size s, i.e., the maximum number of data points a data node can con- 
tain. If c, e/d, r/w denote communication cost function, encryption/decryption 
cost function, read/write cost function with respect to node size s , the function 
is log (2H™) x m x (c(s) + e(s) + d(s ) + r(s) + w(s)). This cost is adjustable. 
Compared with the costs of existing one-server PIR schemes which are linear in 
the database size num, this cost is moderate (poly- logarithmic in num). This 
cost function also shows that with m and num set, there exists an optimal node 
size s to minimize the total cost. Experiment results [1,2] show that the per- 
formance of the protocol is consistent with the theoretical analysis mentioned 
above and, compared with one-server information theoretic private information 
retrieval scheme, this cost is small and hence the proposed approach is more 
practical. 

Although this algorithm hides uniformly distributed tree node accesses, when 
queries are correlated, the server can learn private information about queries and 
data. In this paper, we present techniques that guarantees computational privacy 
even for non-uniformly distributed accesses that are common in real databases. 



4 Problem Formation: Preventing Information Leakages 
Caused by Intersections of Redundancy Sets 

In this paper, we refer to client’s retrieval of a single redundancy set as a call. 
A query then can be represented as an ordered set of calls. For instance, a path 
from root to a leaf would be a sequence of calls from the client to the server. 

Supposing that there is a transport layer security mechanism (e.g. anonymous 
access protocol) that hides the identity of client, the server sees data accesses as 
a stream of calls from unknown origins. We define a stream of calls the server 
has observed during certain period of time as a view. 

Computational privacy requires that any computationally bounded server 
should not be able to tell the difference between client-server communications 
for two different queries. Given the above query model, we note that the server 
might be able to infer information by observing the call stream, by observing the 
intersections of the redundancy sets of the corresponding calls in each query. For 
instance, if there are two queries, Query A and Query B , that are the same and 
consecutive, i.e., without intermediate queries’ interfering, their calls for every 
node on the path will be intersecting, hence providing hints to the server that 
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two identical queries have happened. Figure 2 depicts the phenomenon, with A, 
and Bi denoting A and B ' s corresponding calls respectively. 

Correct identification of identical queries not only increases the risk of leaking 
the traversal path (although such leakage is transient), it also leaks the infor- 
mation as to how frequent queries are posed. If the server happens to know the 
query distribution in advance, i.e., how often every query occurs, and if there 
are distinguished variation among query frequencies, the server can identify the 
queries that have been posed. 

Our goal in this paper is, therefore, provide computational privacy in the 
presence of correlated queries: 

1. Hiding distribution of calls: for any two different queries Q\ and Q 2 posed 
in the view, the distribution of their sequences of calls are indistinguishable 
in polynomial-time. 

2. Hiding intersections: for any two queries Q\ and Q 2 in the view, it is hard 
to tell if they are identical or not by observing their sequences of calls. 



4.1 Privacy Guarantees for Uniformly Distributed Node Accesses 

In [2], we showed that if for every level at the tree, node accesses are uniformly 
distributed(i.e., for every level, tree nodes on this level are accessed by clients 
at the same probability), and if the database is randomly initialized, then the 
protocol has already achieved the required computational privacy. 

Hiding Distribution of Calls. Our proof is based on the following proposition 
and corollary. 

Proposition 1. If the data storage space is randomly initialized and data nodes 
are uniformly accessed in each level of a tree, then data nodes are always uni- 
formly distributed in each level of the data storage space. 

Corollary 1. If the data store is randomly initialized and node accesses are 
uniformly distributed for every level of a tree, then redundancy sets posed are 
also uniformly distributed for any level of the data store. 

If the data storage space is randomly initialized and node accesses are uniformly 
distributed in every level, then data nodes will always be uniformly distributed in 
each level of the data storage space and calls for that level will also be uniformly 
distributed. So for two different queries, if their traversal path lengths are equal, 
the distribution of their sequences of calls are identical, hence indistinguishable 
in polynomial-time; if their traversal path lengths are not equal, clients can 
execute dummy calls at deeper levels to always make the same number of calls. 
Details of the proofs are omitted here due to space constraints. 

Hiding Intersections. As to the second privacy requirement, the only hint 
server has for identifying identical queries from views are intersections between 
calls: if two identical queries are posed consecutively without any interfering calls, 
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their corresponding calls will intersect. However, if node accesses are uniformly 
distributed, even if the server happens to know that at some time t, there is a 
call belonging to Q\ such that one of its target is the data node nd , and the 
server observes that some time later there is a call belonging to Q 2 and the call 
intersects with Qi’s call at nd, it can not judge whether the later call is also 
targeting at nd or not, hence having no hint whether Q 2 is identical to Q 1. If, 
to denotes the size of the redundancy sets, n denotes the number of data nodes 
in tree level where nd is located and N denotes the total number of nodes in the 
data storage space of the level, k denotes the number of target nodes per call, 
then 

— the call of Q2 may intersect with the call of Q\ at nd because it also sets 

c k ~ 1 

target on nd. This probability Pi = -gfr- 

— the call of Q 2 may intersect with the call of Qi at nd because it selects nd 

gk (jm — 2fc — 1 

as one of its random nodes. This probability P2 = ’ll 1 x N m l k >l 1 . 

Cn C N-2k 

If there are enough empty nodes in the level to expand the data storage space, 
e.g., n empty nodes, and if to is large enough, e.g., at least 4 k, P2 will be no less 
than Pi . Hence if the data storage space for the tree is expanded linearly and calls 
contain enough random nodes (only linear redundancy), from the intersections, 
the server can not tell whether two queries are identical or their corresponding 
redundancy sets just happen to intersect and the probability for the polynomial 
time server to have correct judgement whether a view contain identical queries 
or not cannot be non-negligibly better than random guessing. 

4.2 Naive Approaches for Providing Privacy Guarantees 
for Non-uniformly Distributed Node Accesses 

In real situations, queries and node accesses are not always uniformly distributed. 
If there are big variations among node access distributions, higher frequencies of 
intersections will be more likely a sign of repeated occurrence of high frequency 
queries than just random intersections. Thus intersections may leak information 
about occurrence of identical high frequency queries, hence enabling the server 
to deduce query frequencies. If the server knows in advance the tree structure 
and how often each tree node is accessed, such information leakage increases the 
risk of leaking the queries that have been posed. 

These attacks by the server would rely on the intersection property mentioned 
above. Intuitively, such an attack by the server can be prevented by ensuring 
that intersections do not reveal much information. This can be achieved by 
modifying the client/server protocols such that the redundant sets intersect at 
multiple nodes as well as by inserting appropriate dummy/interfering calls. 

Intuitively, dummy interfering calls add ambiguity and reduce the probability 
with which the server can identify the calls that correspond to identical queries. 
Note that in order to provide efficient and provable security, the process of intro- 
ducing dummy calls have to follow a strategy which randomizes the intersections 
with minimal overhead. In this section, we first discuss three naive approaches 
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Fig. 3. Naive uniform approach: (a) original node accesses distribution and (b) adding 
dummy node accesses enables uniform distribution 




Fig. 4. Replication: (a) original tree and node access frequencies, (b) and (c) two 
different ways for replicating nodes 



to minimize information leakage by intersections. In the next section, we present 
a systematic and efficient way to choose proper dummy interfering calls. 

Naive Uniform Approach: A straight forward approach to make tree nodes to 
be uniformly accessed is to generate enough dummy accesses for low frequently 
accessed tree nodes so that all tree nodes are accessed at the same frequency 
with the most often accessed one. Figure 3 depicts the approach. In Figure 3, 
the X axis denotes node access rank (node ranked 1 is mostly accessed) and the 
Y axis denotes the number of accesses for each tree node during unit time in- 
terval. Figure 3(a) shows the original node accesses distribution and Figure 3(b) 
shows the uniform distribution after enough dummy accesses (depicted by grey 
columns) are generated. The cost of this approach is J^iifmax ~ fi) ( fmax is the 
maximum access frequency and /j is the access frequency for node ranked i). 
In general, this naive approach leads to a large number of dummy accesses (for 
example, this would require exponential dummy calls for trees where the root is 
accessed once for each leaf), which would cause a heavy cost for the system. 

Replication Approach: The idea of this naive approach is to replicate nodes 
that are accessed with higher frequencies into multiple copies so that each copy 
is accessed at the same frequency with the lowest frequency node. There are two 
ways to replicate nodes. The first is to make node accesses uniformly distributed 
at every level by applying replication repeatedly from the top to bottom, mak- 
ing access frequency uniform at every level. Figure 4(b) gives an example how 
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to replicate the tree shown in Figure 4(a) in a way that at every level nodes 
are uniformly accessed. In this figure, the number associated with a tree node 
denotes its access frequency. Note that with this approach, the nodes of different 
levels have different access frequencies. Therefore level information is leaked. The 
second approach prevents leakages of level information by applying replication 
to all nodes of the tree, making them accessed at the same frequency with the 
least accessed tree node. Figure 4(c) depicts the tree structure after applying the 
second replication approach to the tree in Figure 4(a). Replication is simple and 
does not need dummy node accesses, hence is query efficient. However, since ev- 
ery high frequency nodes are replicated according to the lowest frequency, huge 
disk space is required to replicate the tree in both of these replication approaches. 
For the first approach, a space of size exponential in the height of the node is 
required to get just one extra copy of this single node. The second replication 
approach needs less replication ( exponential in the tree depth). However, since 
every copy of the parent should maintain addresses of its children, when a child 
is swapped, all parents have to be updated to refer to the new address of the 
child. This will increase the access frequency of the copies of the parent, making 
the unified access non-uniform. Furthermore, for both approaches, if the content 
of a node is updated, all copies of the node should also be updated, making 
update a costly task and changing the uniform accesses of nodes. This problem 
is inherent with replication. 

Clique Approach: Another naive approach to minimize information leaked by 
intersections is to generate calls for queries such that any two queries’ corre- 
sponding calls intersect. If we use a graph to represent a view so that every ver- 
tex depicts a query and every edge depicts the intersection between two queries, 
with this approach, the view forms a clique. The idea behind this approach is to 
make the probability for non-identical queries to intersect equal the probability 
for identical queries to intersect (both equal 1) so that there is no way for the 
server to find identical queries by observing intersections. This approach is also 
too costly. It requires a large redundancy set, for a call should contain more than 
half of the data storage nodes from its level to be able to intersect with corre- 
sponding calls of all other queries. Actually this cost is comparable to sending 
the whole database to the client. 

5 Proposed Approach: 

Clustering Node Accesses into Uniform Chains 

From the naive approaches described above, we observe that challenges associ- 
ated with hiding information that may be leaked by intersections are (a) making 
node accesses seem uniform, (b) without introducing many dummy node ac- 
cesses, (c) without large redundancy set sizes, and (d) without requiring large 
storage space. In this section, we propose a frequency clustering and chain- 
merging approach to minimize information leaked by intersections. 

Let D denote the total number of tree nodes. The idea behind the approach 
is as follows: if we can fractionally cluster D nodes into D equivalent classes 
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Fig. 5. Frequency clustering: (a) Splitting high frequency accesses and (b) access fre- 
quencies are uniformly distributed after merging 
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Fig. 6. Chain merging: (a) Two access chains for node A, B respectively (each edge 
denotes an intersection in redundancy sets as described in Section 4); (b) A’s access 
chain and B ' s access chain are merged together into one single chain(each consecutive 
pair of calls has two intersections) 



such that the total access frequency for each equivalent class is almost equal to 
the average access frequency, and if we can merge node accesses in each equiv- 
alent class together, the server’s view of node accesses will become uniformly 
distributed. This is depicted in Figure 5. Accesses to high frequency nodes are 
split(Figure 5a). For example, the node ranked first is split into segments 1-1, 
1-2, 1-3, 1-4, 1-5. Then as shown in Figure 5(b), accesses to high frequency nodes 
are clustered with accesses to low frequency nodes. For example, segment 1-5 
is clustered with 7 to form cluster c7, segments 1-2, 2-2, 3-2 are clustered with 
4 to form cluster c4. The total frequency for each cluster adds up to f avg - The 
splitting and clustering depicted in Figures 5 exhibits the process of fractional 
clustering of node accesses. For example, the first cluster c7 in Figure 5(b) de- 
picts that with some clustering probability (the ratio between the frequency of 
the segment 1-5 and the total frequency of the first ranked node access), the first 
ranked node access is clustered with the seventh ranked node access. 

The challenge, however, is to make accesses for different nodes in each cluster 
look like the same. Figure 6(a) shows two different chains formed by accesses of 
nodes A and B that are clustered together. The shape of the individual access 
chains can leak information about access frequency to the server. Unless the node 
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chain c3 

tl t2 t3 t4 t5 t6 t7 t8 t9 
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Fig. 7. Crossing among chains are denoted with dashed arrows 




accesses in each cluster resembles each other, the server will observe individual 
accesses. To address this challenge, we propose chain merging (Figure 6). By 
merging chains, we mean that whenever a node in a given cluster is accessed by 
a client, then the client also accesses other nodes in the cluster together with 
this node in a single redundancy set. Figure 6(b) shows when accesses of A and 
B are merged. The access chains of A and B are merged together into one single 
chain and the resulting chain is uniform looking. 

With this approach, the malicious server will observe access chains, each of 
which occurs with approximately f aV g frequency. Hence access frequency distri- 
butions will not be leaked by intersections (except the average access frequency). 
To achieve chain merging, however, there are three further challenges that have 
to be addressed: 

— Maximum cluster size: The maximum number of node accesses per clus- 
ter should be constrained by a subpolynomial; i.e. , we should avoid clus- 
ters that require redundancy sets the sizes of which are polynomial in the 
database size. Otherwise, merging them into one would require large redun- 
dancy, which would cause heavy communication overheads. This is discussed 
in Section 5.1. 

— Cluster directory size and access frequency: We need a storage and search 
efficient directory structure to maintain the clustering information so that 
whenever a client needs to access a target node, it can quickly find the 
corresponding clusters and identify all nodes in those clusters. This directory 
will be maintained in the server in encrypted manner. Since each node access 
is preceded by a directory lookup, the size of the directory structure as well 
as its access pattern should not leak further information. This is discussed 
in Section 5.2 

— Chain crossings: a node may belong to more than one cluster, making multi- 
ple chains cross with each other. Figure 7 gives an example. In this example, 
there are three clusters: C\ = { A , Bj, c-i = { A , C}, C3 = {D}. The axis 
depicts time line and the small circle that above time f, represents the clus- 
ter that is accessed at time U. Since A is in clusters C\ and C2, we can see 
that besides the intersections which form a chain for each cluster (depicted 
by solid arrows), there are some crossing intersections (depicted by dashed 
arrows) between chains C\ and €2- However, since C3 shares no elements with 
Ci and C3, there is no crossing associated with C3. Such crossings between 
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Fig. 8. Node access distribution example: a balanced tree with uniform leaf accesses 



chains should have uniform pattern, otherwise a powerful server may infer 
extra information from their distribution. For this example, the server may 
deduce that cluster Ci and cluster C 2 share some high frequency element. 
This is discussed in Section 5.3 

In the rest of the paper, we address each of these challenges. 

5.1 Minimizing the Maximum Cluster Size 

In this section, we show how to restrict the maximum number of accesses per 
cluster to 3 (except for boundary clusters) . The procedure starts from the highest 
and lowest frequency nodes and progressively moves towards medium frequency 
nodes. We first split the highest frequency node access into segments of enough 
volume to fill the gap between the lowest frequency and the desired average 
frequency. Then, we use the highest frequency node to fill as many of the lowest 
frequency node accesses as possible. In most cases this scheme will lead to clusters 
of size 2. In some cases, it may be that the segment of the highest frequency node 
access can not fill the gap between the low frequency and the average frequency 
and need to cluster with a segment from the next highest frequency, resulting 
in a cluster size of 3. In this way, we can continue using segments of the highest 
available frequencies to fill the unfilled lowest frequencies, getting clusters of size 
2 or 3. 

An extreme exception may happen at the final stage of the above process 
when the highest available frequencies are just above f aV g, and the unfilled low- 
est frequency is below f avg , and the rank of highest available frequencies are 
immediately before the rank of unfilled lowest frequency. Let i denote the rank 
of the final unfilled lowest frequency node. Because the available highest frequen- 
cies fi-j, fi-j+ 1 ... fi-i are just above f avg , and /,; is pretty much below f avg , 
hence a number of node accesses (ranked i — j,i — j + l, ...i — 1) are needed to fill 
the gap between and f avg - In the case of trees accesses (shown by Figure 8) 
we can show that the number of nodes in the extreme cluster is bounded by the 
depth, d, of the tree. Suppose there are d distinct access frequencies (one for each 
level) and there are 2 l_1 different i th ranked node accesses, each with access fre- 
quency 2 d ~ l . This extreme case happens if /,_ 1 is just above |, i.e. , /*_ 1 = f + 1 
if d is even, or /,_i = 2±1 if d i s odd. In both cases, the final boundary clusters 
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have a size of 0(d). In conclusion, if D denotes the total number of nodes, this 
way of clustering generates 0(D) clusters of size 2, at most d boundary clusters 
of size 3, final boundary clusters of size 0(d). The advantage of this is that the 
number of chains that has to be merged and the required cluster size are both 
small. 

5.2 Implementing a Storage and Search Efficient Cluster Directory 

Directory information of clusters is available in two places: (a) a cluster table 
C maintains which nodes are in a given cluster and what their frequency shares 
are; and (b) for each child node in the tree, its parent keeps the list of clusters 
the child belongs to and the corresponding clustering probabilities (determined 
by the ratio of the child’s frequency shares and its total access frequency) for 
the child. The cluster table C is encrypted and stored in a fixed address in the 
server space (this is called snode in [1, 2]). Entry i of the cluster table C records 
the identifiers of all nodes the accesses of which are clustered into cluster Ci . The 
tree structure is also encrypted (Section 3). 

While traversing from a parent node to a child node, a client first identifies 
which clusters the child node belongs to and picks one of those clusters using 
the associated clustering probabilities. Figure 9 provides an example. In this 
figure, the node A is fractionally clustered into 4 different clusters: { A , B}, 
{ A , C } { A , D }, and {A, E}. Each time A is accessed from its parent, one of 
these 4 clusters will be chosen based on the clustering probabilities stored in the 
pointer. Therefore, the resulting cluster access pattern can be represented as a 
random walk graph shown in Figure 9(a), whose vertices (i.e, clusters) have equal 
number of visits (i.e., uniform access distribution). The weights associated with 
arrows are traversal probabilities for A that are calculated from H’s clustering 
probabilities. For example, P \2 denotes the probability with which the next 
access of A is clustered into C 2 if the current access of A is clustered into ci. 
Once the cluster to be used is identified, the client uses the cluster table C to 
find other nodes in the chosen cluster and requests all the nodes in the cluster in 
a single redundancy set. Since the cluster table and pointers are encrypted, the 
only information the server observes are the sizes of the entries in the directory 
and their access frequencies. We can hide entry sizes from the server by extending 
all of them to the maximum cluster size. Since chains are uniformly accessed, 
entries of C are uniformly accessed, giving no extra information per access except 
(possibly) the random identifier of the cluster for which we do not care. 

The cluster table search cost is 0(1): the client only needs to access one 
entry of C per access and each entry has a constant size (as discussed above). 
The storage cost of the cluster table is proportional to the size of the database. 
Since the cluster table maintains pure node identifiers and generally the size of 
a node is much greater than the size of a node identifier, the cluster table is 
small compared to the size of the database. However, unless properly designed, 
the storage cost for pointers can be prohibitively expensive: each parent stores 
all the clusters each child belongs to and, since (in the worst case) a node can 
be clustered into D (the size of the database) many clusters, the pointers could 
become prohibitively large. 
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Fig. 9. Reducing pointer costs: (a) In the original structure, the pointer has to maintain 
all clusters of node A; (b) in the new structure, the pointer only maintains the next 
cluster 



To prevent this, instead of storing all the clusters of a given child, the parent 
should store only a small (constant) number of clusters of the child. This would 
ensure that the storage requirement is small; however since not all clusters are 
available, the clustering probabilities for the child would be altered, destroy- 
ing the uniform access distribution of the cluster chains. In order to prevent 
the access distribution of the chains diverging from uniform, we need to read- 
just traversal probabilities and modify the cluster table to let it contain more 
information. 

Figure 9(b) shows how to reduce the storage cost for pointers from Pa to A. 
In this example, we first reduce the number of cluster neighbors from 3 to 2. To 
achieve this, we need to recompute the random walk traversal probabilities. The 
probabilities associated with the new random walk graph are computed so that 
the clustering probabilities of A remain the same and the access frequency of 
each cluster is kept uniform. The cluster table C is modified so that: each entry 
of cluster reflects the two possible next clusters for A and their corresponding 
traversal probabilities, based on the new random walk graph. Then, the parent- 
child pointer from Pa to A is modified such that, instead of maintaining all 
clusters that A belongs to, it only maintains the next cluster that will be used 
when A is accessed. Based on the pointers and traversal probabilities stored in 
the corresponding entry of the cluster table, the value of next cluster will be 
updated after each access. 

5.3 Eliminating the Chain Crossing Problem 

To address the chain crossing problem, we present two solutions. For the first 
solution, we borrow the idea from replication approach presented in Section 4.2: 
Each node is replicated according to its contribution to its clusters. For each cor- 
responding cluster, the node has a replica. When the node needs to be accessed, 
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one of the replicas is chosen based on how the nodes accesses are distributed 
among its clusters (i.e., the clustering probabilities). We call this solution merge- 
replication. Since copies of a node are independently accessed, crosses among the 
chains that are caused by sharing of the node is removed. Since the maximum 
size of the clusters is small, the amount of total replication is also small. When 
a child is swapped and the references to it needs to be updated, the client needs 
to update the physical address only in the corresponding entry in the cluster 
table. However, this solution is restricted to read only applications for update is 
inherently costly for replication. 

If we do not use replication, the crossings will exist and be visible to the 
server. Therefore, in the second solution, we embed the existing chain crossings 
into dummy chain crossings that are uniformly distributed among the existing 
chains. Since, as described in the previous subsection, the number of crossings per 
cluster can be limited through random-walk based readjustment of the traversal 
probabilities, the amount of dummy crossings per cluster are small. Details of 
this process are omitted. 

6 Conclusion 

In this paper, we build on the protocol we presented in [1, 2] to develop a proto- 
col that enable secure outsourcing of tree structured data and hides correlations 
in tree traversals from the untrusted server in computational privacy sense. The 
early protocol [2] ensured privacy when accesses were uniformly distributed. To 
ensure computational privacy in face of non-uniformly distributed node access 
which actually occur in real scenarios, in this paper, we presented a systematic 
way to enhance this protocol so that from the server’s view, node accesses are 
uniformly distributed. Since a lot of data, such as XML, has tree-like structures 
and queries can be expressed as traversal paths on these trees, this protocol 
can be utilized for secure outsourcing of XML documents. Compared with exist- 
ing private information retrieval techniques [13,22], our protocol does not need 
replication of databases and it requires less communication, and is thus practical. 
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Abstract. When outsourcing data to an untrusted database server, the 
data should be encrypted. When using thin clients or low-bandwidth 
networks it is best to perform most of the work at the server. In this 
paper we present a method, inspired by secure multi-party computation, 
to search efficiently in encrypted data. XML elements are translated to 
polynomials. A polynomial is split into two parts: a random polynomial 
for the client and the difference between the original polynomial and 
the client polynomial for the server. Since the client polynomials are 
generated by a random sequence generator only the seed has to be stored 
on the client. In a combined effort of both the server and the client a 
query can be evaluated without traversing the whole tree and without 
the server learning anything about the data or the query. 



1 Introduction 

Nowadays the need grows to securely outsource data to an untrusted system. 
Think, for instance, of a remote database server administered by somebody else. 
If you want your data to be secret, you have to encrypt it. The problem then 
arises how to query the database. The most obvious solution is to download the 
whole database locally and then perform the query. This of course is terribly 
inefficient. 

We propose a method that looks like secure multi-party computation where 
two parties, a client and the database server, together evaluate a query. Before 
we will present our solution (section 4) we will say a few thinks about secure 
multi-party computation in general (section 3). 



2 Related Work 

Most modern database management systems (DBMS) include functionality to 
encrypt records. However, they lack native support to query these records. Berti- 
noro [1] have studied how to protect XML data by using a diversified key ap- 
proach. 

In [2] techniques are presented to support keyword-based search on an en- 
crypted textual string. We adapted this work to exploit the tree structure in 
XML documents in [3]. 
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Other techniques to support keyword-based search on encrypted textual 
strings are presented in [4] . All these keyword based search techniques can only 
be used to find exact matches. [5] provides an order-preserving scheme for nu- 
meric data that allows any comparison operation directly applied on the en- 
crypted data. In [6, 7] techniques are explored which execute SQL-based queries 
over encrypted relational tables in a database-service provider model, where an 
algebraic framework is described for query rewriting over encrypted attribute 
representation. 

In [8] a single-server solution for remote querying of encrypted relational 
databases on untrusted servers is presented. The approach is based on the use of 
B+ tree indexing information attached to the relations. The designed indexing 
mechanism can balance the trade-off between efficiency requirements in query 
execution and protection requirements due to possible inference attacks exploit- 
ing indexing information. 

Traditionally, databases are protected against a malicious intruder by means 
of an access control mechanism. However, the database management system 
itself is trusted. When the data is outsourced the database system cannot be 
trusted any more to keep the query and the answer secret. Private Information 
Retrieval [9] aims at letting a user query the database without leaking to the 
database which data was queried. The idea behind PIR is to replicate the data 
among several non-communicating servers. A client can hide his query by asking 
all servers for a part of the data in such a way that no server will learn the 
whole query by itself. [9] proves that PIR with a single server can only be done 
by sending all data to the client for each query. In practice database replication 
is not preferable. Computational PIR [9-11] aims at achieving the same goal as 
information theoretic PIR but uses cryptographic techniques. [12] uses a single 
server scheme which is a compromise between total privacy and efficiency. A 
query is hidden by asking for more nodes than required. The server cannot tell 
which nodes are really needed and which ones are just dummy nodes. To avoid 
replay attacks and server learning, all nodes in the retrieved set are shuffled and 
stored at different locations after each query. 

3 Secure Multi-party Computation 

We speak of secure multi-party computation when several parties calculate a 
function result without giving the other parties access to their input. More pre- 
cisely, the parties want to evaluate the function result (jq, . . . , y n ) = f{x i, . . . , x n ) 
where each parameter Xi is the private input of party Pi and yt its private 
output. It is also possible that all y' s are equal. In that case it is written as 
y = /(cci , . . . ,x n ). In principle there exist schemes that can evaluate any func- 
tion securely using secure multi-party computation [13]. However, no efficient 
multi-purpose schemes are known to us at the moment. 

For example, let / be an anonymous voting function. Each voter Pi can 
vote for a decision {xi = 1) or against it ( Xi = 0). The function / can be 
defined as the function f(x i, . . . , x n ) = ]C"=i x i (m case °f a majority vote) or 
as f(x i, . . . , x n ) = Tir=i x i (i n case °f a veto system). 



20 



Richard Brinkman, Jeroen Doumen, and Willem Jonker 



One characteristic of secure multi-party computation is the lack of a trusted 
third party. In our example there is no need for a trusted party to count the 
votes. 

Many secure multi-party computation protocols are based on Shamir’s secret 
sharing scheme [14] . These protocols have at least two phases. In the first phase 
each party P, splits up its input Xi in such a way that at least t < n shares 
are needed to reconstruct x,. In the second phase each party Pi calculates its 
share of the function result given only his own input and the shares of the other 
parties. Now, the complete function result is shared over all parties. 

We will now give the implementation of one specific secure multi-party com- 
putation protocol. In this protocol Pi shares its input variable Xi by choosing a 
random polynomial gi of degree t such that gi(0) = a Pi sends to each other 
party Pj the value of gi(j). When t parties collaborate they can reconstruct the 
original polynomial g 1: by interpolating the t points ( j,gi(j ))■ With the polyno- 
mial it is easy to recalculate Xi = gi (0) . 

The second phase consists of the local computations with the distributed 
shares gi(j) and depends on the function /. For simplicity reasons we consider 
only our voting case where f{x\, . . . , x n ) = x i- Each party Pj locally cal- 

culates the sum h(j) = J2i=i9iU)- Having at least t collaborating parties and 
thus t points (j, h(j)) it is possible to construct the polynomial h = Yli=i 9 i an d 
also f(x i, . . . ,x n ) = h{ 0). 

4 Searching in Encrypted Data 

One way to look at the problem of searching in encrypted data [3, 15, 16] is to 
consider the search algorithm as a search function that is to be evaluated in 
the sense of secure multi-party computation. The function takes two arguments, 
data and query , as input, data is the private input of the client but stored on the 
server and query the private input of the client. We achieved this by splitting the 
original data into a random part data c u en t and a server part data server such that 
data = data c u ent + data server . Since data c u en t is generated by a pseudo random 
generator it can be forgotten provided that you keep the random seed. Damiani 
et al [17] use the same strategy in the relational setting. Thus the search function 
becomes search(data serveri query). Both the client and the server contribute to 
the evaluation of this function. The representation and the splitting of the data is 
not a trivial problem. One way to represent the data is explained in the following 
section. In section 4.2 we will solve the problem of sharing and in section 4.3 the 
querying of the data. 

4.1 Data Representation 

Secure multi-party computation works best with simple algebraic expressions 
like polynomials. It is possible to map the tree of elements from an XML file to 
a tree of polynomials. We will demonstrate this mapping by way of the example 
shown in figure 1(a). 
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customers 



(x-3)((x-2)(x-4)) 2 




client client 



tagname Z 



( x — 2) (x — 4\x — 2)(x — 4) 



name name 



customers 3 
client 2 

name 4 



(x — 4) (a; — 4) 



(a) XML example (b) Mapping from tagname (c) Data representation in 



to numbers 



non-compressed form. 



Fig. 1. XML example and its non-reduced representation as a tree of polynomials 

First we introduce a mapping function from tag names to integers ( map : 
tagnames — > Z). The mapping function may be chosen arbitrarily. For our ex- 
ample we choose the mapping function displayed in figure 1(b). The mapping 
function should be private to avoid the server to see the query (see section 4.3). 

The tree of XML elements is represented as a tree of polynomials. The tree is 
built from the leaves up to the root node. The leaf node name is translated into 
the polynomial (x — map{name)) = (x — 4). Every non-leaf node is calculated 
as the product of the polynomials of all its children times itself. For instance, 
in figure 1 customers is represented as (x — map(customers))((x — 2)(x — 4)) 2 , 
where (x — 2)(x — 4) represents each client node. Figure 1(c) shows all represented 
elements. 

To avoid large degree polynomials we will work in a finite ring. We have 
investigated two different rings: F 9 [x]/(x 9_1 — 1) (where q is a prime power q = p e . 
For the reader’s convenience, all proofs will be given for q prime) and Z[x]/(r(x)) 
(where r(x) is an irreducible polynomial). In the first case the coefficients of the 
polynomials are reduced modulo q. If p is prime then Va £ F p : a = 1 
(mod p). Since these polynomials will only be used for evaluation in points of 
F p [x], it makes sense to store the polynomials modulo x p_1 — 1. In effect, this 
means we are working in F p [x]/(x p_1 — 1). In order to avoid zero divisors, we 
will avoid mapping a tagname to p — 1. Thus we reduce every polynomial to a 
polynomial of degree less than p — 1 with coefficients in F p . 

When working in Z[x]/(r(x)), the polynomial is reduced modulo an irre- 
ducible polynomial r(x). The resulting degree is less than the degree of r(x). 
However, the coefficients are elements of Z and can get quite large for large 
trees. 

Although we calculate in a finite ring, no information about the original tag 
names is lost. We will prove this in theorems 1 and 2 for the respective cases 
F p [x]/(x p_1 — 1) and Z[x]/(r(x)). But before we can prove theorem 1 we need 
some lemmas. 

Lemma 1. If p is prime then nf=i (x — i) = x p ~ x — 1 (mod p). 

Proof. Let /(x) = nfJi( x — *) an d g(x) = x p_1 — 1. Two polynomials are the 
same if they have exactly the same roots. All elements of F* = {1, . . . ,p — 1} 
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3a: 3 + 3a: 2 + 3a; + 3 




a: 2 + 4a; + 3 x 2 + 4a; + 3 



x + 1 x + 1 

(a) F 5 [a;] 



265a: + 45 




—6a: + 7— 6a; + 7 

x — 4 x — 4 
(b) Z[x 2 + 1] 



Fig. 2. The same XML example as in figure 1 but now reduced from Z[x] to the finite 
rings F p [x]/(x p_1 — 1) and Z[x\/{r{x)) 



are roots of /(x). By Fermat’s little theorem, for p prime all these p—1 roots of 
/(x) are also roots for g(x). Thus the two polynomials are equal. 

Lemma 2. Let p be prime and /(x) £ F p [x]. If f{x) is non-zero mod x — (p—1) 
then /(x) is also non- zero modulo x p_1 — 1. 

Proof. Since /(x) = 0 (mod x p_1 — 1) (x p_1 — l)|/(x) and from lemma 1 it 

follows that x — (jp— l)|x p_1 — 1 in F p [x], we can conclude that x — (p— l)|/(x) 
and thus also that f(x) = 0 (mod x — (p — 1)). This proves that /(x) = 0 
(mod x p_1 — 1) =>■ /(x) = 0 (mod x — (p — 1)), which is equivalent to the 
statement of the lemma. 

Lemma 3. Let p be prime, and let f(x) € F p [x] be defined as /(x) = IIf=i 2 ( x — 
i) ei . Then /(x) ^ 0 (mod x p_1 — 1). 

Proof. Consider the evaluation of /(x) at p — 1: 

p- 2 

/(p-i)= 

i = 1 

Because Vi £ {1, ... ,p — 2} : i ^ p — 1, f(p — 1) ^ 0. Thus x — (p — 1) cannot 
be a factor of /(x), and we have that /(x) ^ 0 (mod x — (p — 1)). By lemma 2 
this implies that /(x) ^ 0 (mod x p_1 — 1). 

Now we are ready to prove that the mapped values can be retrieved uniquely: 

Theorem 1. Given a polynomial f(x) in F p [x]/(x p ^ 1 — 1) (p prime) of an el- 
ement node and all polynomials (q \, . . . , q n ) of its children, the mapped value 
map{node) can be retrieved uniquely. 

Proof. Because of the way the polynomial /(x) of the element node was con- 
structed, we know at least one solution exists for the equation 

/(x) = qi (x) • • -q n {x){x - t ), 



where t is the mapped value to be retrieved. To prove that the solution is 
unique, suppose there are two solutions t\ and ^2 to this equation: /(x) = 



Using Secret Sharing for Searching in Encrypted Data 



23 



2x 3 + 3x 2 +2x + 2 



3a: 3 + x 2 + 3x + 4 4a; 2 + 3x + 3 



2a; + 2 



4a; 3 + 2a; 2 + 2a: 



(a) Client part 



a; 3 + x + 1 



2ar + x + 4 



2ar + x 



4a; + 4 x 3 + 3a; 2 + 4x + 1 



(b) Server part 



Fig. 3. The shared data over client and server. The sum of a polynomial at the client 
side with the corresponding polynomial at the server side equals the original polynomial 
of figure 2(a). All polynomials are elements of Fs[a;]/(a; 4 — 1) 



q\ (x) ■ ■ ■ q n (x) (x - f i ) and / (x) = qi (x) ■ ■ ■ q n (x) (x-t 2 ). Then q x (x) • • • q n (x) (x - 
t\) = q\(x) ■ ■ ■ q n (x)(x — t 2 ). This can be rewritten to 

< 7 i(x) • • -q n {x)(ti - t 2 ) = 0 (mod p). 

Thus either q\{x) • • • q n {x) =0 (mod p) or ( t\ —t 2 ) = 0 (mod p). Since we know 
that < 7 i(x) • • • q n (x) ^ 0 (mod p) by lemma 3 (the qi s match the required form 
by construction), we can conclude that ti = t 2 (mod p). 

Theorem 2. Given a polynomial /(x) in Z[x]/(r(x)) of an element node and 
all polynomials (q \ , . . . , q n ) of its children, the mapped value map(node) can 
uniquely be retrieved. 

Proof. As in theorem 1 due to construction there exists at least one t that 
satisfies /(x) = qi(x) ■ ■ ■ q n (x)(x — t ) (mod p). To prove that the solution is 
unique suppose there are two solutions fi and t 2 - Then q\ (x) • • • q n (x) (ti—t 2 ) = 0 
(mod r(x)). Since r(x) is irreducible, and none of the <?i(x) are zero modulo r(x) 
(by construction), we have that ti — t 2 = 0 (mod r(x)). Therefore t\ = t 2 . 

Note that in both cases the actual solution for t can easily be found. 

4.2 Data Sharing 

Before the data can be stored on the server, it should be split into two parts: one 
for the server and one for the client. The client builds a tree structure similar to 
the tree structure of the original data. But instead of just copying the elements 
it chooses random polynomials. Also it builds the tree to be stored on the server. 
The sum of the corresponding polynomials should be equal to the polynomial 
of the original tree. Look for example to the top nodes of figure 4. The sum 
(9x — 12) + (256x + 57) equals the root node of figure 2(b) (265x + 45). 

If the client does not have the storage capacity to store the whole tree, it could 
store only the random seed with which the random polynomials were generated 
and recompute the needed entries of the tree for each query. 

Note that this is a direct application of a basic secret sharing scheme (as is 
often used in secure multi-party computations) . This can easily be extended to a 
model with multiple servers, in which the client together with k out of n servers 
(or any other access structure) can reconstruct the shared secret polynomial. 
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9x — 12 





3x jf 3 —2® + 8 —9a; + 4 —4* — 1 

—8a; + 2 12x — 1 9a; — 6 —11a; — 3 

(a) Client part (b) Server part 

Fig. 4. Another sharing with the same principles as in figure 3 but now with polyno- 
mials in Z[a;]/(a; 2 + 1) 



4.3 Querying 

Now that the data has been shared on both the client and the server, we will 
describe how to query the data. First we will discuss simple element lookups: 
find an element given its tag name. In section 4.3 we will look at more difficult 
XPath queries. 



Element Lookup. We assume that the document of figure 1 has been shared 
as described in section 4.2. Let’s further assume that we would like to evaluate 
the query //client. This XPath expression means that we want to find ‘client’ 
elements somewhere in the tree. Normally (even in the non-encrypted case) this 
boils down to traversing the whole tree and comparing the tag names with the 
name ‘client’. We will do it smarter than that. 

First we use the mapping function to translate the tag name ‘client’ to x = 2 
(see figure 1(b)). The client sends this value of x to the server. If we want to 
keep the query secret for the server the mapping function should be private to 
the client. 

The server evaluates the polynomials in the given point (x = 2). Each time 
a polynomial has been evaluated the calculated value is sent back to the client. 

The client does the same thing on its own side. Furthermore it calculates the 
sum of the client element and the server element. If this sum equals zero than 
the element contains a factor (x — 2), meaning either that the element has tag 
name ‘client’ or that it contains a descendant named ‘client’. A sum different 
from zero means that the branch is dead. If this is the case the client informs 
the server so that the server can stop evaluating polynomials for elements in the 
tree starting with that branch. 

Each zero element in the sum tree that does not have a zero sub element 
represents an answer to the query. All other zero’s in the sum tree may or may 
not represent correct answers. To find out whether the element itself or one of its 
descendants is named ‘client’, the non-shared polynomials of both the element 
and all its direct children have to be reconstructed. 

To reconstruct the element value, let / be the sum of the polynomials on the 
server and the client of an element and q±, ... ,q n the combined polynomials of 
all its direct children. 
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1 4 2 4 3 3 

(a) Client part (b) Server part (c) Sum 



Fig. 5. Query result for the query l x = 2’. Both the server and the client evaluates the 
polynomials for the given value of x modulo p. The server sends its values to the client 
which adds it to its own calculated value. A branch is a dead end if the sum is not 0 




1 3 2 0 3 3 

(a) Client part (b) Server part (c) Sum 



Fig. 6. Query result for the query l x = 2’ for the case h[x\/{x 2 + 1). everything is 
calculated modulo r( 2) = 2 2 + 1 = 5 



By construction we know that / can be written as 

n 

f = i x - *) n qi ( m ° d r ) (!) 

i= 1 

To check the correctness of an answer we have to solve t in f{x) = 0. In our 
example t should be 2. 

Theorem 2 proves that there is just a single solution for t. It is solved by: 
d = d(r) j 

f =0 (mod r) J ^ (2) 

ad-ix d_1 + ad -2 x d ~ 2 + • • • + a\X + ao = 0 



Where each a* is a function in t. Note that the same scheme can be used for 
the field ¥ p / ( x p ~ l — 1). 

a d -i{t) = 0 

(3) 

a 0 {t) = 0 



A single (non-trivial) equation in 3 is enough to solve t. The other equations 
may be used to verify the result. Remember that we did not trust the server. 
We now have at least a way to check the answer. If, however, we trust the 
server to give correct answers, only the last equation is enough. In that case 
only the constant factor (without x) of each polynomial stored on the server has 
to be transmitted. This reduces bandwidth and increases efficiency but decreases 
security. 
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Advanced Querying. So far we evaluated only queries like //tagname. But 
also more elaborate XPath queries can be performed. It is of course possible to 
evaluate a query like //a/b//c/d/e from left to right. That is, search the tree 
for occurences of ‘a’, then search within the found branches for ‘b’, etc. But it 
is more efficient to evaluate the whole query at once. Since every polynomial in 
the tree consists of the roots of all its descendants, a single query can find all 
elements that contains the elements a, b, c, d and e (in any order). In this case 
a search consists of the following steps: 

1. from the root node find all ‘a’ elements that have b, c, d and e elements 
somewhere deeper in the tree 

2. from the found nodes find all direct children ‘b’ that have elements c, d and 
e as descendants 

3. ... 

Using this strategy elements are filtered out in a very early stage and therefore 
increases efficiency. 

5 Conclusion and Future Work 

We have seen a method to store a tree of XML elements as a tree of polynomials 
and two reduction schemes, one in r L[x]/{r{x)) and one in F p [x]/(x p_1 — 1). 
These trees are split in a server and a client part. Both parts are needed to 
retrieve the original data. The created trees can be used to query the data in 
a secure way. Our scheme has only a small penalty in storage space compared 
to the unencrypted case. To store an XML tree with n elements and p different 
tagnames in an unencrypted way we need a storage space in the order of n\ogp. 
In the encrypted case the orders for the cases Z[x]/(r(x)) and F p [x]/(x p_1 — 1) 
are n(d + 1) logp" = n 2 (d + 1) logp respectively n(p — 1) logp, where d is the 
degree of r(x). 

The extra amount of storage space is used as a smart index which enables an 
efficient search strategy. Each element has some knowledge of its descendants. 
When searching the tree for an element, a branch can be marked as a dead-end 
in a very early stage. Thus, only a small portion of the tree has to be examined. 

In this paper we only looked at storing and retrieving trees of tag names. 
We did not take into account the actual data between the tags. We cannot 
straightforwardly use the same method for the actual data because, in order 
to keep the mapping function invertible, p and therefore the storage capacity 
becomes unreasonably large. We can use a hash function to map the data to an 
element of Z p but in that case the mapping function is no longer invertible. In 
this case the data polynomials can be used as an index to the encrypted data. 
Another approach would be to choose a totally different approach like Song et 
al [2], Feng and Jonker [16] or using bloomfilters [18]. The storage and retrieval 
of the actual data is still subject to ongoing research. 
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Abstract. A new simple and efficient database encryption scheme is presented. 
The new scheme enables encrypting the entire content of the database without 
changing its structure. In addition, the scheme suggests how to convert the con- 
ventional database index to a secure index on the encrypted database so that the 
time complexity of all queries is maintained. No one with access to the en- 
crypted database can learn anything about its content without having the encryp- 
tion key. 



1 Introduction 

Database is an integral part of almost every information system. According to [1] the 
key features that databases propose are shared access, minimal redundancy, data con- 
sistency, data integrity and controlled access. 

The case where databases hold critical and sensitive information is not rare, there- 
fore an adequate level of protection to database content has to be provided. Database 
security methods can be divided into four layers [2]: physical security [3], operating 
system security [4, 5, 6], DBMS security [7, 8, 9] and data encryption [10, 11, 12]. 
The first three layers alone are not sufficient to guarantee the security of the database 
since the database data is kept in a readable form [13]. Anyone having access to the 
database including the DBA (Database Administrator), is capable of reading the data. 
In addition, the data is backed up frequently so access to the backed up data also needs 
to be controlled [14]. Moreover, a distributed database system makes it harder to con- 
trol the disclosure of the data. 

Database encryption introduces an additional security layer to the first three layers 
mentioned above. It conceals the readable form of sensitive information even if the 
database is compromised. Thus, anyone who manages to bypass the conventional 
database security layers (e.g., an intruder) or a DBA, is unable to read the sensitive 
information without the encryption key. Furthermore, encryption can be used to main- 
tain data integrity so that any unauthorized changes of the data can easily be detected. 

Database encryption can be implemented at different levels [14]: tables, columns, 
rows and cells. Encrypting the whole table, column or row entails the decryption of 
the whole table, column or row respectively when a query is executed. Therefore, an 
implementation which decrypts only the data of interest is preferred. 
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The database encryption scheme presented in [13] is based on the Chinese- 
Reminder theorem where each row is encrypted using different sub-keys for different 
cells. This scheme enables encryption at the level of rows and decryption at the level 
of cells. The database encryption scheme presented in [14] extends the encryption 
scheme presented in [13] by supporting multilayer access control. It classifies subjects 
and objects into distinct security classes. The security classes are ordered in a hierar- 
chy such that an object with a particular security class can be accessed only by sub- 
jects in the same or a higher security class. In this scheme, each row is encrypted with 
sub-keys according to the security class of its cells. One disadvantage of both schemes 
is that the basic element in the database is a row and not a cell, thus the structure of the 
database needs to be changed. In addition, both schemes require re-encrypting the 
whole row when a cell value is modified. 

The conventional way to provide an efficient execution of database queries is by 
using indexes, but indexes in an encrypted database raise the question of how to con- 
struct the index so that no information about the database content is revealed [15, 16]. 

The indexing scheme provided in [17] is based on encrypting the whole row and as- 
signing a set identifier to each value in this row. When searching a specific value its 
set identifier is calculated and then passed to the server which in turn returns to the 
client a collection of all rows with values assigned to the same set. Finally, the client 
searches the specific value in the returned collection and retrieves the desired rows. 
However, in this scheme, equal values are always assigned to the same set, thus some 
information is revealed when applying statistical attacks. 

The indexing scheme provided in [18] is based on constructing the index on the 
plaintext values and encrypting each page of the index separately. Whenever a specific 
page of the index is needed for processing a query, it is loaded into memory and de- 
crypted. Since the uniform encryption of all pages is likely to provide many cipher 
breaking clues, the indexing scheme provided in [19] suggests encrypting each index 
page using a different key depending on the page number. However, these schemes 
being implemented at the level of the operating system are not satisfactory. 

Assuming the index is implemented as a B+-Tree, encrypting each of its fields 
separately would reveal the ordering relationship between the ciphertext values. The 
indexing scheme provided in [15] suggests encrypting each node of the B+-Tree as a 
whole. However, since references between the B+-Tree nodes are encrypted together 
with the index values, the index structure is concealed. 

In order to overcome the shortcomings of existing database encryption schemes, a 
new simple and efficient scheme for database encryption is proposed which suggests 
how to encrypt the entire content of the database without changing its structure. This 
property allows the DBA to continue managing the database without being able to 
view or manipulate the database content. Moreover, anyone gaining access to the 
database can learn nothing about its content without the encryption key. The new 
scheme suggests how to construct a secure index on the encrypted database so that the 
time complexity of all queries is maintained. Since the database structure remains the 
same no changes are imposed on the queries. 

The remainder of the paper is structured as follows: in section 2 the desired proper- 
ties of a database encryption scheme are outlined; in section 3 the new database en- 
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cryption scheme is illustrated; in section 4 the desired properties of a secure indexing 
scheme are described; in section 5 a new indexing scheme for the encrypted database 
is proposed; in section 6 performance and implementation issues are discussed, and 
section 7 presents our conclusions. 



2 The Desired Properties of a Database Encryption Scheme 

According to [13], a database encryption scheme should meet the following require- 
ments: 

1) The encryption scheme should either be theoretically or computationally secure 
(require a high work factor to break it). 

2) Encryption and decryption should be fast enough so as not to degrade system 
performance. 

3) The encrypted data should not have a significantly greater volume than the unen- 
crypted data. 

4) Decryption of a record should not depend on other records. 

5) Encrypting different columns under different keys should be possible. 

6) The encryption scheme should protect against patterns matching and substitution 
of encrypted values attacks. 

7) Modifying data by an unauthorized user should be noticed at decryption time. 

8) Recovering information from partial records (records where some cells have null 
values) should be the same as from full records. 

9) The security mechanism should be flexible and not entail any change in the 
structure of the database. 

A naive approach for database encryption is to encrypt each cell separately but this 
approach has several drawbacks. First, two equal plaintext values are encrypted to 
equal ciphertext values. 

Vi = Vs — ( 1 ) 

Therefore, it is possible, for example, to collect statistical information as to how many 
different values a specified column currently has, and what are their frequencies. The 
same holds for the ability to execute a join operation between two tables and collect 
information from the results. Second, it is possible to switch unnoticed between two 
ciphertext values. Different ciphertext values for equal plaintext values can be 
achieved using a polyalphabetic cipher (e.g. Vernam). However, in this solution de- 
cryption of a record depends on other records and thus requirement 4 is violated. 

In the next section a new database encryption scheme complying with all the above 
requirements is presented. 



3 A New Database Encryption Scheme 

The position of a cell in the database is unique and can be identified using the triplet 
that includes its Table ID, Row ID, and Column ID. We will refer to this triplet as the 
cell coordinates. 
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We suggest a new database encryption scheme where each database value is en- 
crypted with its unique cell coordinates. These coordinates are used in order to break 
the correlation between ciphertext and plaintext values in an encrypted database. The 
new scheme has two immediate advantages. First, it eliminates substitution attacks 
attempting to switch encrypted values. Second, patterns matching attacks attempting 
to gather statistics based on the database encrypted values would fail. 
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Fig. 1. Database encryption using two approaches. 

Figure 1 illustrates database encryption using two approaches. Figure la describes a 
database table (T) with one data column (C). Figure lb describes encryption of table T 
using the naive approach. Figure lc describes encryption of table T using the new 
approach where each cell is encrypted with its cell coordinates. It is easy to see that 
equal plaintext values in figure la are encrypted to different ciphertext values in figure 
lc as opposed to the ciphertext values in figure lb. 

3.1 Encryption/Decryption in the New Scheme 

Let us define: 

V trc - A plaintext value located in table t , row r and column C . 

)l (N X N X N) — > N - a function that generates a number based on the database 
coordinates. 

Enc k - A function which encrypts a plaintext value with its coordinates. 

En ^k (Vtrc ) = E k (V,rc 0 r, C)) (2) 

Where k is the encryption key and E , is a symmetric encryption function (e.g. DES, 
AES). 

X trc - A ciphertext value located in table t, row r and column c. 
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x trc = E nc k (V trc ) (3) 

Dec k - A function which decrypts a ciphertext value (X trc ) and discards its coordinates. 

Dec k (X trc ) = D k (X trc ) ® M (T, R , C ) = V, c (4) 

Where k is the decryption key and D k is a symmetric decryption function. 



3.2 Data Integrity 

Encryption ensures that a user not possessing the encryption key cannot modify a 
ciphertext value and predict the change in the plaintext value. Usually the range of 
valid plaintext values is significantly smaller than the whole range of possible plain- 
text values. Thus, the probability that an unauthorized change to a ciphertext value 
would result in a valid plaintext value is negligible. Therefore, unauthorized changes 
to ciphertext values are likely to be noticed at decryption time. 

Substitution attacks as opposed to patterns matching attacks can not be prevented 
simply by using encryption. In the new scheme, each value is encrypted with its 
unique cell coordinates. Therefore, trying to decrypt a value with different cell coordi- 
nates (e.g. as a result of a substitution attack) would probably result in an invalid 
plaintext value. 

If the range of valid plaintext values is not significantly smaller than the whole pos- 
sible range, or invalid plaintext values cannot be distinguished from valid plaintext 
values, encryption has to be carried out as follows: 

EnC K ( y,rc ) = E k (V ,rc II ^ < 0 ) (5) 

Since ) is concatenated to the plaintext value before encryption, attempt- 

ing to change the ciphertext value or trying to switch two ciphertext values would 
result in a corrupted jU(t,r,c) 1 after decryption. Obviously, concatenating jil(t,r,C ) 
results in data expansion. 

3.3 Scheme Analysis 

The new database encryption scheme satisfies the requirements mentioned in sec- 
tion 2: 

1) The scheme security relies on the security of the encryption algorithm used. In 
order to reveal some database value it has to be decrypted using the correct key. 

2) Encryption and decryption are fast operations and are mandatory in any database 
encryption scheme. The proposed implementation adds the overhead of a Xor 
operation and )l computation which are negligible compared to encryption. 

3) Using encryption algorithms such as DES or AES which are based on encrypting 
blocks of data results in value expansion (in many cases this expansion is negli- 
gible). 



1 



f U l implementation is discussed in section 6.2. 
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4) The basic element of reference is a database cell. Operations on a cell do not de- 
pend on or have any effect on other cells. 

5) The proposed scheme facilitates subschema implementation. Since each cell is 
encrypted separately, each column can be encrypted under a different key 2 . 

6) The new scheme prevents patterns matching attacks since there is no correlation 
between a plaintext value and a ciphertext value (achieved by using encryption) 
and there is no correlation between ciphertext values (achieved by using /./ be- 
fore encryption). Substitution attacks are also prevented as discussed in section 
3.2. 

7) Unauthorized manipulation on the encrypted data without the encryption key 
would be noticed at decryption time, (see section 3.2) 

8) As the basic element of reference is a database cell, it is possible to recover in- 
formation from partially completed records (records with null values) in the 
same way as it is recovered from full records. 

9) The new scheme complies with the structure preserving requirements as the ba- 
sic element of reference is a database cell. 



4 The Desired Properties of a Secure Indexing Scheme 

An index is a data structure supporting efficient access to data and indexes are fre- 
quently used in databases. Most commercial databases even create a default index on 
the primary-key columns. Most databases implement indexes using a B+-Tree which 
is a data structure maintaining an ordered set of values and supporting efficient opera- 
tions on this set such as search, insert, update and delete. 

Figure 2 illustrates a database index which is constructed on column C in table T 
and is implemented as a B+-Tree. A graphical representation of the B+-Tree is given 
in figure 2a; a table representation of the B+-Tree is given in figure 2b and table T is 
given in figure 2c. Figure 2b sharpens the separation between the index structure and 
its data. 

A secure index in an encrypted database has to comply with the following require- 
ments: 

1) No information about the database plaintext values can be learned from the in- 
dex. 

2) The secure index should not reduce the efficiency of data access. 

3) The secure index should not reduce the efficiency of insert, update and delete 
operations. 

4) The secure index should not have a significantly greater volume than an ordinary 
index. 

5) The secure index structure should not differ from a standard index. In this way, a 
DBA can manage the index without the encryption key. 

A trivial approach which constructs an index over the plaintext values would reduce 
security since the plaintext values are exposed. Another approach would construct the 



2 Key management is discussed in section 6.3. 
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index over the database ciphertext values. In this approach, executing equality queries 
is possible but executing range queries is a problem. This approach would expose the 
index to patterns matching attacks since equal plaintext values are encrypted to equal 
ciphertext values. Moreover, since executing range queries is a problem, Oracle does 
not support encrypting indexed data [20] . 



a) An Index Constructed on 
Column C in Table T 



b) Table Representation of 
the Index 



0 




Fig. 2. An example of a database index. 



In the next section, a new indexing scheme which overcomes the shortcomings of 
existing indexing schemes is presented. 



5 A New Database Indexing Scheme 

Several indexing schemes for encrypted databases were proposed [15, 18, 17, 21] that 
fulfill most of the requirements described in section 4 but none preserve the index 
structure. We claim that there should be a separation between data and structure. For 
example, A DBA should be able to manage database indexes without the need of de- 
crypting its values. 

We suggest a new database indexing scheme which preserves the index structure 
where each index value is the result of encrypting a plaintext value in the database 
concatenated with its row-id. This ensures that there is no correlation between the 
index values and the database ciphertext values 3 . Furthermore, the index does not 
reveal the statistics or order of the database values. 



3 If the database is encrypted as described in section 3.2, then should not be implemented 
as jU(t,r,c) = r since there will be a strong correlation between the index values and the 
database encrypted values. 
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5.1 Index Construction in the New Scheme 

In order to construct an index, a set of values and a function determining the order 4 of 
these values are needed. 

Let us define: 

C - An encrypted database column that was encrypted as defined in section 3.1. 

C - The column obtained from decrypting column C : 

Dec k (x trc ) e C p < x trc e C (6) 

Where Dec k is the decryption function defined in section 3.1. 

C, - The column obtained from encrypting values in C p concatenated with their 
row-ids: 

A(v,„lh')sc i < — >v,„€C„ (7) 

Where k is the encryption key, E k is an encryption function and r is the row id. 

A k : C ; — > C p - A function which decrypts a value in C ; (using key k ) and dis- 
cards its row-id: 

A k (x) = Discard (D k (x),| r |) (8) 

Where k is the decryption key, D k is a decryption function, r is the row-id, | r | is 
the length of r in bits, and Discard (v,n) stands for discarding the n rightmost 
bits of V . 

R p - The values in C p are ordered by the relation R p : 

( X , y ) e R p < y g C p And{x < y) (9) 

R ; - The values in C. are ordered by the relation R, : 

(x, y) G R . < > x, y g C t And (A k (x), A k (y)) e R p (10) 

The new index will be constructed based on the values in C ; , using the relation R j as 
an order function. 

Figure 3 illustrates encryption of the table and the index which were illustrated in 
figure 2 using the new schemes. Figure 3a describes the encryption of the table in the 
new scheme where each cell is encrypted with its coordinates. Figure 3b describes the 
encryption of the index where each index value is the result of encrypting a database 
plaintext value concatenated with its row-id. It is easy to see that the table and index 
structure are not changed by the encryption process. 



4 Some indexes require only an equality function and not an order function to be constructed. In 
this case, the term "order" in this section can be replaced by the term "equality". 
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Fig. 3. Encryption in the new scheme. 



5.2 Executing a Query in the New Scheme 

The following SQL query illustrates the retrieval of all rows in table T, which their 
values in column C are greater or equal to V : 

SELECT * FROM T WHERE T.C>=V (11) 

The following pseudo code illustrates the retrieval of row-ids of rows which answer 
the above query. The pseudo code assumes that the index is implemented as a binary 
B+-Tree. 

INPUT: A table T, a column C and a value V. 

OUTPUT: A collection of row- ids. 

X := getIndex(T, C) . getRootNode ( ) ; 

While X is not a leaf Do 

If X . getData ( ) . getValue ( ) <V Then 
X := X . getRightSonNode ( ) ; 

Else 

X := X . getLef tSonNode ( ) ; 

End I f ; 

End While; 

RESULT :=(/); 



While X . getData (). getValue () <V Do 
X := X . getRightSiblingNode () ; 

End While; 

While X is not null Do 

RESULT := RESULT U { X . getData (). getRowId ()} ; 

X := X . getRightSiblingNode () ; 

End While; 

Return RESULT; 

Each node in the index which is not a leaf has a left son node , a right son node and 
a data which stores a value. Each leaf in the index has a right sibling node and a data 
which stores a value and a row-id. 

In the new scheme the data in each index node is an encryption of a database value 
concatenated with its row-id. Thus, the functions getValue() and getRowIdQ need to 
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be given a new implementation in order to support the new indexing scheme. How- 
ever, the above pseudo code stands without any change. 



5.3 Index Integrity 



In the new scheme, a substitution attack which attempts to substitute index values can 
be carried out without being noticed at decryption time. If it is possible to maintain a 
unique position for each value in the index, this kind of attack can be eliminated using 
a technique similar to the one proposed in section 3 where each value is encrypted 
with its unique position. 

Figure 4 illustrates data integrity maintenance of the table and the index which were 
illustrated in figure 2. Figure 4a describes data integrity maintenance of the table as 
suggested in section 3.2. Figure 4b describes data integrity maintenance of the index 
where each index value is concatenated to its unique position in the index (ID) and 
then encrypted. 

We argue that without changing the index structure and affecting its efficiency, 
maintaining a unique position for each value in the index is not a trivial matter. 

a) Maintaining Data Integrity of b) Maintaining Data Integrity of 
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Fig. 4. Maintaining data integrity. 



5.4 Scheme Analysis 

The new index implementation on an ordered set of values is identical to the ordinary 
index implementation. The only differences between the ordinary index and the new 
one are the set of values and the order function defined on them. 

The new index complies with the requirements mentioned in section 4: 

1) Since the values in the index are encrypted and unique (achieved by concatenat- 
ing row-id) there is no correlation between them as to the column ciphertext val- 
ues, or the column plaintext values. Therefore, no information is revealed on the 
database data by the new index. 

2) The order function is implemented in a time complexity of 0 ( 1 ) since decryp- 
tion and discarding bits are implemented in a time complexity of 0 ( 1 ) . There- 
fore, data access using the proposed index is as efficient as with an ordinary in- 
dex. 
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3) Determining the order of two values is implemented in a time complexity of 
0(Y) . Therefore, the delete operation is as efficient as in an ordinary index. En- 
crypting a new value is implemented in a time complexity of 0(1) , thus the ef- 
ficiency of insert and update operations is not changed. 

4) Each value in the new index is a result of encrypting a database plaintext value 
concatenated with its row-id, therefore the space added for each node in the new 
index is fixed. Thus, the index space complexity remains the same. 

5) The new index structure remains the same and only its data is modified. Thus, 
any administrative work on the index can be carried out without the need of de- 
crypting the index values. 



6 Performance and Implementation Issues 

Implementing the new schemes requires careful consideration. Several performance 
and implementation issues are discussed in this section. 

6.1 Stable Cell Coordinates 

The proposed scheme assumes that cell coordinates are stable. That is, insert, update 
and delete operations do not change the coordinates of existing cells. However, if a 
database reorganization process changes cell coordinates, all affected cells are to be 
re-encrypted with their new coordinates and the index updated respectively. 

A naive implementation which uses the row number in the table as the row-id, 
proves to be limited in this respect as row numbers are affected by insert and delete 
operations. In the Oracle database, for example, cell coordinates are stable. 

6.2 Implementing a Secure fi Function 

As defined in section 3.2, the values in the database are encrypted as follows: 

Enc K (V trc ) = E k (V trc || ju(t, r, c)) (12) 

A secure implementation of // would generate different numbers for different coordi- 
nates: 

0i,7i,q) * (t 2 ,r 2 ,c 2 )< »//(?!, r 1; q) * fl(t 2 ,r 2 ,c 2 ) (13) 

Unfortunately, generating a unique number for each database coordinates may result 
in considerable data expansion. An alternative implementation reducing the data ex- 
pansion may result in collisions. Assume that there are two cells, which fu generates 
two equal values for their coordinates: 

, t'| , C’| , 1 2 , 1' 2 , c 2 | 

■ ■ ■ (14) 

[(t l ,i\,c l )*(t 2 ,r 2 ,c 2 )]A[jU(t 1 ,r 1 ,c 1 ) = J U(t 2 ,r 2 ,c 2 )] 

It is possible to substitute the ciphertext values of these cells ( X t and X t ) with- 
out /V being corrupted at decryption time. If it is difficult to find two cells such as 
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those mentioned above, this kind of attack can be prevented. This can be achieved by 
using a collision free hash function. 

6.3 Key Management 

Databases contain information of different sensitivity degrees that have to be selec- 
tively shared between a large numbers of users. The proposed scheme facilitates sub- 
schema implementation since each column can be encrypted with a different key. 
Encrypting each column with a different key, results in a large number of keys for 
each legitimate user. However, using the approach proposed in [22] can reduce the 
number of keys. It is suggested in [22] how the smallest elements which can be en- 
crypted using the same key according to the access control policy can be found. Thus, 
the keys are generated according to the access control policy in order to keep their 
number minimal. This approach can be incorporated in the proposed scheme in order 
to encrypt sets of columns with the same key in accordance with the database access 
control policy. 

6.4 Performance 

In the new scheme, all conventional algorithms remain the same since the structure of 
the database remains the same. This ensures that the only overhead of the new scheme 
is that of encryption and decryption operations. 



7 Conclusions 

In this paper, a new structure preserving scheme for database encryption has been 
presented. In the new scheme, each database cell is encrypted with its unique position 
and this guarantees that patterns matching and substitution attacks cannot succeed, 
thus, guaranteeing information confidentiality and data integrity. 

A new database indexing scheme that does not reveal any information on the data- 
base plaintext values was proposed. In the new scheme index values are encrypted 
with a unique number (the row-id of the database value) in order to eliminate patterns 
matching attacks and any correlation between index and database values. Ensuring 
index integrity is possible if an index position can be attached to each index value by 
simply using a technique similar to the one used for table encryption. 

The new schemes do not impose any changes on the database structure, thus ena- 
bling a DBA to manage the encrypted database as any other non-encrypted database. 
Furthermore, implementing the new scheme in existing applications does not entail 
modifying the queries. 
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Abstract. We provide a formal model of security guarantees offered by 
digital signature schemes when they are applied to structured data. This 
model is an important step towards managing the integrity of data that is 
shared, integrated, transformed, and exchanged on the World Wide Web. 
We express signature semantics using well-known database constraints, 
which can help authors decide what to sign, help recipients evaluate the 
integrity of signed data, and clarify the capabilities of different signature 
technologies. 



1 Introduction 

Data exchange on the World Wide Web is characterized by many original au- 
thors, many contributing or integrating agents, and many final recipients. For 
example, a report on scientific data exchange [16] finds that a major source of 
scientific discovery today is the dry laboratory , which takes previously published 
experimental data and processes, cleans, integrates, and republishes it. For some 
applications, including scientific data exchange, it is critical for users to be able 
to trace the original source of each data item and to prevent tampering. That 
is, users require data integrity , which means accurately attributing data to its 
author and preventing unauthorized modification of data items. Our goal is to 
provide integrity in large-scale data exchange by using digital signatures. 

We do not propose a novel signature scheme here. Instead, our main contri- 
bution is a formalization of known signature schemes [7, 18, 9, 12, 13] in terms 
of logical constraints over database instances. We provide a formal model of 
integrity for signed data which we use to address a number of difficult open 
problems in adapting signatures to data exchange. For one, the original data 
author uses this formal model in choosing a specific signature scheme, which 
constrains how contributing agents downstream may modify the data while still 
attributing it to the original author. Second, end users can use this formal model 
- along with well-established database theory - to reason about the integrity of 
queries over signed data received from intermediate agents. Finally, the formal 
model provides a uniform treatment of disparate techniques, allowing them to 
be compared precisely and combined appropriately. 

Basic setting. The basic setting for providing integrity in data exchange is 
illustrated in Figure 1. Alice is an author who publishes a database R. This data 
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Alice ) 5 ►[ Bob ) — 

— ■ ^ Alice <i i -uJ ® Alice 

Fig. 1. Simplified data exchange scenario. S Alice is Alice’s signature on R. S' Mice is a 
(possibly different) signature object provided by Bob to Carol (but still in the name of 
Alice). 




is received and processed by Bob, who may transform it, integrate it with other 
data sources, or provide some other service, and then publishes it in the form of 
a database R' . Carol is an end-user who wants to use Bob’s enhanced data R' , 
yet wants to verify that the content in R' that came from Alice has not been 
modified. To ensure the integrity of her data, Alice provides a digital signature 
of R to Bob, who uses it to derive a digital signature on the modified data R' . 
This, in turn, is verified by Carol. 

The simplest way to sign R is to apply a conventional digital signature to the 
entire database. Any modification of R will cause verification to fail, but this 
signature strategy prevents Bob from making any meaningful changes to the 
data. Therefore it is often important for Alice to sign the data so as to allow 
extraction, integration, and sometimes controlled modifications of the data by 
Bob. 

A formal model for signed data. In recent work, a number of signature 
schemes [18, 9, 12, 7, 13] have been devised that can be used to manage the bal- 
ance between preventing unauthorized modification and allowing reuse of data 
(we describe their features in Sec. 2). However, there is currently no unified 
formal model for stating their properties, making them hard to use in complex 
data exchange scenarios. We propose here such a formal model, which describes 
the security properties offered by a signature as a set of constraints that hold 
between the original (but unavailable) source R and the received database R' . 

We believe a formal model is critical to providing integrity in data exchange. 
It is the basis upon which Alice chooses the correct signature (to prevent unau- 
thorized modification but allow innocuous modification). It also allows Carol to 
analyze the integrity of the data she receives. For example, Carol may want to 
evaluate a particular query over the data and be assured that the answer to the 
query could not have been modified by Bob. Finally, a formal model clarifies 
the capabilities of known signature schemes and suggests new signature schemes 
required for providing integrity in data exchange. 

Constraints appear to be the right tool because they are capable of expressing 
the semantics of signatures, they are familiar to database practitioners, and 
because there is a rich theory whose results can be applied to our setting (see 
Section 4). 

Application: scientific data exchange. The management of molecular bi- 
ology data is an application scenario requiring the management of integrity for 
exchanged data. Primary sources contain original experimental data, from which 
hundreds of secondary biological sources [2] are derived. The secondary sources 
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Fig. 2. Exchange scenario for scientific data. 



export views over primary sources and/or other secondary sources, and usually 
add their own curatorial comments and modifications [14] . These databases are 
often published on the Web as structured text files - not stored in proprietary 
systems or servers that can provide security guarantees. The data consumers are 
scientists, and a significant fraction of research takes place in so-called “dry” 
laboratories using data collected and curated by others. An illustration of this 
scenario is provided in Figure 2. 

The main security concerns in this setting are attributing and retaining au- 
thorship, permitting proper curatorial additions, and avoiding the careless mod- 
ification of data through integrity controls. The risk of malicious tampering with 
the data is usually not a primary security concern in this setting. To the best of 
our knowledge, security properties are rarely provided in scientific data exchange. 
Although in some cases authorship may be traced, there is little verification or 
certification of accurate authorship. 

Digital signatures alone cannot solve the challenges of providing integrity in 
such complex scenarios. The formal model presented below allows us to unify 
disparate signature technologies in terms of constraints that they enforce, and 
forms a basis for managing integrity in data exchange. 

Paper organization. In Section 2 we present some simple integrity challenges, 
and then describe three types of known digital signature schemes that can be 
used to address them. In Section 3, we present a formal model of each signature 
scheme, and in Section 4 we apply the formal model to query answering. We 
summarize related work and conclude in Sections 5 and 6. 



2 Signing Data to Provide Integrity 

We begin this section with some simple examples of properties an author may 
want to enforce over published data. Then we describe informally three known 
classes of signature schemes (conventional, homomorphic, and tree-based signa- 
tures) . 

2.1 Integrity Challenges for Data Exchange 

We illustrate here several properties that Alice may want to enforce when signing 
a data source, using a simple example consisting of a database relation Stock 
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Fig. 3. A database of stock recommendations Stock(ticker, rating, industry) shown in (a) 
along with sets of tuples (b) and (c) derived from (a). 



describing attributes ticker, rating, and industry of stock recommendations. A 
sample database is illustrated in Fig. 3. While our interest is in richer domains 
like scientific data, we use this dataset to simplify the discussion. Further, we 
restrict ourselves to relational data even though semi-structured XML data is a 
more likely choice in large-scale data exchange. 

Recall from Fig. 1 that Alice wants to sign the data so as to allow Bob to 
perform only certain transformations. We consider in this paper the following 
challenges: 

1. Alice requires Stock to be complete and correct whenever it is attributed 
to her: every tuple must be present, and no forged tuples may be added by 
Bob. 

2. Alice allows tuples to be removed from Stock, permitting Bob to publish a 
subset of the stock recommendations. However it should not be possible for 
Bob to introduce tuples not present originally. 

3. Alice requires all tuples of Stock to be present, but Bob may add additional 
tuples. 

4. Alice allows subsets of Stock tuples defined by a selection condition on the 
industry attribute, but all tuples must be provided for each such selection. 
Fig. 3(b) satisfies this requirement for condition industry='Technology'. 

5. Alice permits Bob to update rating attribute of any tuple, but he cannot 
modify other columns. All tuples must be present in the collection. 

6. Alice permits Bob to add a new attribute such as risk-premium to Stock. All 
tuples must be present in the collection. 

7. Alice permits Bob to remove the industry attribute from Stock; that is, Bob 
can publish the relational projection of Stock on ticker and rating, which 
must be complete. Fig. 3(c) satisfies this requirement. 

2.2 Existing Signature Techniques 

While there are techniques which address individual integrity challenges, there 
is no general framework for signing structured data and evaluating the integrity 
of signed data. Here we briefly review existing technologies: conventional digital 
signatures, homomorphic signature schemes, and tree-based query certification 
schemes based on Merkle trees. 
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Conventional digital signatures. Aside from key generation, a conventional 
digital signature scheme consists of two operations, SIGN and VERIFY, which 
we apply to databases. These operations are employed in our basic setting as 
follows: 

Alice : signs relation R by computing signature S Alice = SIGN^; ice (.R). 
Bob : receives S Alice from Alice. Publishes S Alice and relation R'. 

Carol : Verifies signature by computing VERIFY AUce(S Alice, R') 

Here \ZER\FY AUce{S Alice, R') returns yes if and only if R = R'\ otherwise it 
returns no. Alice’s private key is implicit in SIGN^^e and her public key is 
implicit in VERIFY^iice- 

Ideally, it is computationally infeasible to compute a valid signature on a 
database without knowledge of the private key. A common digital signature 
scheme is built using the RSA public-key cryptosystem [17] and a message digest 
like SHA-1 [20]. The output of the message digest on the database (appropriately 
padded [19]) is signed by encrypting it under a private key. A recipient verifies 
a signature by retrieving the author’s public key, using it to decrypt the signa- 
ture, and checking that the result is equal to the padded digest of the database 
purportedly signed. 

Obviously, conventional signatures are restricted, since they allow Bob to 
perform very limited operations on the data. We illustrate here on our running 
example. 

Example 1 (Applying conventional signatures) . 

(a) A conventional signature applied to relation Stock can be used to implement 
Challenge (1). This is the typical use of a digital signature, just described. 

(b) A conventional signature can implement Challenge (2) where Alice wishes 
only to ensure that authorized tuples are provided, but permits deletions. 
A danger exists however that Bob could collect tuples signed by Alice at 
different times or in different contexts, and mix tuples to construct a collec- 
tion in which each tuple verifies. To avoid this, for each tuple t, and given 
an identifier r unique for each instance i?, Alice can sign the pair (r, t). The 
identifier r may simply be a date or timestamp, and Carol must check the 
consistency of each r in the tuples she receives from Bob. 

(c) A conventional signature can also implement Challenge (4), to support se- 
lections on industry. Alice will compute a view of the database for each 
industry. For instance, in Datalog notation, the following views return the 
sets of (ticker, rating, industry) triples for each of the Technology, Financial, 
and Consumer sectors. (The result of V\ is pictured in Fig 3(b)). 

V\ (t,r,i) : — Stock(t,r,i ), i= 'Technology' 

V2 (t, r,i) : — Stock(t, r, i), i = 'Financial' 

V3 (t,r,i) : — Stock(t,r,i ), i= 'Consumer' 

Alice must sign each view definition together with the view result. Bob may 
present any of the view results to Carol with a verifiable signature from Al- 
ice. Bob cannot tamper with the tuples in any view result. This technique 
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is straightforward, but neither efficient nor feasible in general because Al- 
ice must predict which selections or views Carol may need, and generate 
signatures for each. 



Homomorphic signatures. Recently, digital signature schemes have been pro- 
posed [18, 12, 9] that allow anyone (i.e. without knowledge of the private key) to 
compute new signatures for certain data values. The new signatures are com- 
puted from signed data items, and can be computed only for data items derived 
in certain limited ways from signed data items. Such a scheme is used to permit 
Bob to modify or extract data signed by Alice, and then compute Alice’s sig- 
nature on the derived data. The signatures on the derived objects are designed 
to be indistinguishable from signatures computed by the private key holder. 
Clearly, the basic security property of a digital signature does not hold for a ho- 
momorphic scheme because certain signatures are easily computed (i.e. forged). 

Homomorphic signature schemes have been proposed for specific operations 
like subset, redaction [9] and transitive closure [12]. We define the homomorphic 
signature scheme for subset, and describe the other two briefly below. 

Alice : Alice signs R by computing HS Alice = HSIGN [subset] Alice (R). 

Bob : receives HS Alice from Alice; uses HS Alice to compute a new signa- 
ture HS' Alice = HSIGN [subset] Aiice(R') for an y subset R' of R of 
his choice. 

Carol : Verifies signature by computing VERIFY AUce{H S Alice , R') 

Here VERIFY Aiice(H S Alice , R’) returns yes if and only if R' C R. Alice’s 
private key is implicit in HSIGN[sw6sei]Ajice and her public key is implicit in 
VERI FY/u,; ce . It should be computationally infeasible for Bob to construct new 
signatures like HS' AUce for sets that are not subsets of R. We omit the actual 
description of the subset signature scheme referring the reader to [9] instead. 

Example 2 (Applying homomorphic signatures). 

(a) Set operations - The subset signature scheme clearly allows us to address 

Challenge (2). 

A redaction signature scheme applies primarily to a text. To redact a textual 
data element x means to replace any selection of the characters in x with a fixed 
symbol, say #, thereby hiding the selected portions. A signature scheme that 
permits redaction allows anyone to derive a signature of any redacted version of 
x from the signature of x. Such a scheme is proposed in [9]. For example, given 
a signed text "Dec. 1, 1972", a signature can be computed for the redacted 
version "Dec. 1, 19##". 

A transitive closure signature scheme [12] allows an author to sign nodes and 
edges representing an undirected graph such that anyone can derive a signature 
of an edge between nodes for which there exists a signed path. For example, given 
tuples (a, 6), ( b , c), (d, e) signed by Alice, Bob can efficiently compute signatures 
for (a, c), but cannot compute a signature for (a, d). 
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Tree-based signatures for query certification. We illustrate here the main 
idea behind the techniques [7, 6, 13] based on Merkle trees [10, 11] . To simplify the 
discussion we assume a binary relation R(x,y). A query certification signature 
scheme allows Alice to sign R in such a way that it allows Bob to publish the 
answer to any query q of the form: 

R'(x, y) : -R(x,y),x = a 

for an arbitrary constant a. Notice that while R' C R, a subset signature scheme 
is not useful here, since R' may not be an arbitrary subset but must consist of 
precisely the tuple(s) with a certain value of x. As we discussed earlier, Alice 
could simply use a conventional signature and sign all possible answers (there 
are no more answers than tuples in R , plus one), but the query certification 
technique allows Alice to provide a much shorter signature object. Bob can use 
it to construct a new signature for R' , for any specific a. Formally: 

Alice : Alice signs R by computing MS Alice = MSIGI\Ui ice (i?). 

Bob : receives MS Alice from Alice; for any constant a, Bob uses MS Alice 
to compute a new signature M S q Alice for the answer to the query 
R'(x,y) : -R(x,y),x= a. 

Carol : Verifies signature by computing \/ER\FY AUce{M S^ lice , R'). 

Here VERIFY j 4 ;ice (A/5'^ iice , R') returns yes if and only if R' is obtained from 
R by answering some query of the form R' (x , y) : —R(x, y),x = a. 

We briefly illustrate the Merkle tree for our running example in Fig. 4. Alice 
uses a collision-resistant hash function / to build a binary tree of hash values as 
follows. First, she computes the hash for each tuple fj. Then she pairs these val- 
ues, computing the hash of their concatenation and storing it as the parent. She 
continues bottom-up, pairing values and hashing their combination until a root 
hash value h e is formed. Note that h e is a hash value that depends on all tuples 
in her database. Alice publishes a description of / and SAUce = SIGI\Uzice(^e)- 

Bob can now produce a verification object for the query below: 

R'{r,r) : - R(t,r ), t = WFMI 

Bob provides t% = (WFMI, hold) as an answer and proves the accuracy of this 
answer by providing a path of hash values in the tree sufficient to compute h' e . 
In this case, Bob gives Carol h\\ and ho, in addition to Alice’s original signature 
S Alice- Carol computes h\o = /(is) and uses the provided hash values to compute 
h( which is verified against h e signed by Alice. 

This basic technique can be extended to support selection queries and range 
queries over non-key attributes, and to some additional types of relational queries 

[7]- 

Example 3 (Applying query certification). A tree-based signature scheme can 
be used to efficiently implement Challenge (4) where Alice would like to al- 
low authorized publication of selection queries on industry. She would sort her 
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h e =f(h 0 II h, ) 




t 1= (IBM, buy) 



t 2 =(MSFT, hold) t 3 =(WFMI, hold) 



t 4 =(JPM, sell) 



Fig. 4. A Merkle tree built over an abbreviated version of the relation Stock. The root 
value h e is computed bottom-up by repeated applications of /, a collision-resistant 
hash function. Concatenation is denoted ||. 



database of stock recommendations on industry, and generate a tree-based sig- 
nature. Then for any industry ind, Bob can provide a verified answer for query 
q(r,t,i ) : — R(r,t,i),i = ind. Here Alice can construct the tree-based signature 
object without knowledge of the particular queries Carol may ask. 

3 Modeling Signatures Using Constraints 

Recall that Carol uses the mediated data R' received from Bob, in place of the 
original data R authored by Alice. A verified signature on R' provides Carol with 
a guarantee about a certain relationship between R and R' . We now explain how 
logical constraints can be used to formalize these guarantees. To do so, we will 
work with statements of the following form: 

VERIFY(5buicei R') N constraint-expression 

VERIFY(iT S Alice, R 1 ) \= constraint-expression 

VERIFY(Af5'^ iice , R') \= constraint-expression 

On the left-hand side we have a verification operation (conventional, homomor- 
phic, or tree-based) performed by Carol on a signature object and data instance 
R' received from Bob. On the right is a constraint expression referring to R and 
R' . The meaning of a statement is that successful verification proves 1 that the 
constraint expression holds. 

In the following subsections we describe a language for constraint expres- 
sions, provide correct constraint expressions for the signature schemes described 
above, and finally explain how to choose a signature scheme to enforce a desired 
constraint. 



Relative to the security assumptions of the signature scheme. 



l 
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3.1 Constraints 

We express a constraint as a logical formula called an embedded dependency [1], 
having the following form 2 : 

Van . . . V*n[3yi, . . . 3y m <p(xi,. . . x n ,yi, ■ ■■ y m ) — > 3zi, . . . 3zkip{xi, ...x n ,z i, . . . z*,)] 

where both ip and ip are conjunctions of positive relational atoms and equal- 
ity/inequality predicates, and each of them uses all the variables x\,...x n . 

Constraints are a fundamental topic in databases, which are used to define 
properties that must hold for all database instances. In the relational model they 
are most commonly used to express key and foreign key relationships. A broad 
theory of constraints has been developed over time. Two of the most important 
theoretical problems are inference (deciding whether a new constraint is implied 
by existing constraints), and query optimization (improving query execution 
using constraints known to hold over the data). 

An example of a basic embedded dependency expresses completeness by as- 
serting that R' contains every tuple in R , i.e. R C R'\ 

c c : Vaq . . . \/x n [i?(xi, . . . x n ) — » R'(x i, . . . x n )\ (completeness) 

If every tuple of R' is present in R , i.e. R' C I?, then we write the constraint: 

c s : Vaq . . .\/x n [R'(x\, . . .x n ) — > R(x i, . . . x n )] (soundness) 

We define C eq to be the set of constraints {c c ,c s } and note that C eq holds iff 
R = R' . 

3.2 Constraints Enforced by Signatures 

We now review each of the signature schemes described in Section 2 and formalize 
their security guarantees with constraint expressions. 

Conventional signatures. If Alice uses a conventional signature to sign a 
relation R , that signature will verify on R' if and only if R = R' . Thus the 
following statement holds: 

VERIFY(SUz ice , R') b C eq 

Suppose, as in Example 1(c), that Alice signs the view: 

V\ (t, r,i ) : — Stock(f, r,i), i = 'Tech' 

over Stock so that SAUce = SIGN AUce(Vi (Stock)). Then the following statement 
holds: 



VERIFY(S) 4 z ice , R') b VtVr [Stock(f, r, 'Tech') — > R'(t, r, 'Tech')] 

VtVrVi [R'(t, r, i) — » Stock(t, r, i) A (i = 'Tech')] 

2 The variables yi, . . ■ y m are not technically needed, but convenient for the examples 
in this section. 
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Homomorphic signatures. Signing R as a subset of tuples with a homomor- 
phic signature scheme supporting subsets, we have HS Alice — HSIGN [subset] A/ice 
( R ) and the following statement holds: 

\/ER\FY (HS A i ice , R') 1= Vxi . . .Vx n [R'(xi, . . .x„) -> R(x lt . . .x n )] 

To model the redaction signature, we represent a document as a binary relation 
R(x,y), where the first attribute is a position number and the second is a char- 
acter. For example a text document like “This is a message. . is represented 
as R = {( 1 , 'T'), ( 2 , V), ( 3 , V), ( 4 , 's'), . . .}. Then the redaction signature scheme 
enforces the following constraint: 

VERIFY (HS AHce , R!) |= Vx [By R(x, y ) - By' R'(x, y’)} A 

Vx [By' R'(x, y') -> By R(x, y)} A 

Vx [By By' R(x, y) A R(x, y') Ay b y' -»■ y' = '#'] 

Tree-based signatures. Let MS Alice = MSIGN^;j ce (i?) be a tree-based signa- 
ture on R, and let MS q Alice be the certification object. As before, we assume R 
to be a binary table, to simplify our discussion. 

VERI FY Alice {M s Alice ’ R ') 1= VxVy (#(*» V) V)) A 

WxWyVy' ( R'{x,y ) A R{x,y') -» R'(x,y')) 

By writing the constraint this way we do not enforce that R'(x, y) contain a 
single value a on the x position, but we enforce the fact that whenever it contains 
some tuple (a, b ) then it contains all tuples where x = a. For example if R = 
{(ai, 6), (02, c), (02, d), (03, e)} then the constraint holds for R' = {(<12, c), (02, d)} 
and it holds similarly for R' = {(ai,6), (02, c), (02, d)}. But the constraint does 
not hold for R' = {(ai,6), (02, c)}. One could write a more complex constraint 
that requires a single value for x in R', but this is unnecessary since Carol can 
check it herself by examining all tuples in R' . 

By abuse of notation we write this constraint as: 

VERIFY^ e(MS g Alice ,R') A b q(R)=R' 

where q is the query R'(x,y) : —R(x, y),x = a. Notice that the query needs to 
specify a certain constant a, while the actual constraint is independent of any 
constant. 

A subtle but important aspect of this formalism is that it allows us to express 
precisely what Carol knows about the original source. For example, suppose that 
Carol asks Bob two different queries, q\ and (72, and Bob provides her with two 
answers R\ and R 2 plus two certification objects MS Alice and MS q Alice ■ Then 
Carol can verify that these come from the same database signed by Alice, i.e.: 

VERIFY Ai ice (MS q A \ ice , i?i) A VERIFY A i ice {M S q * lice , R 2 ) b «i {R) = Ri A qi{R) = R2 

This is because Carol can check that the two verification objects carry the same 
original signature by Alice, hence they refer to the same instance of the database. 
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3.3 Constraints for Other Integrity Challenges 

The remaining challenges from Sec. 2 are listed below. We express each as a 
constraint. 

(3) Alice requires all tuples of Stock to be present, but Bob may add addi- 
tional tuples. This is the completeness constraint: 

c c . Vxi . . . Vx n , . . . x n ) * R (x \ , . . . £n)] 

A conventional signature can enforce the combined constraints {c G ,c s }, but 
there is a subtlety in enforcing c c alone. Although it is not difficult to enforce 
that each of Alice’s tuples are present in any collection Bob publishes, those 
tuples would have to be distinguished from tuples later added by Bob. It’s 
not clear how to enforce this challenge while hiding this distinction from 
Carol. 

(5) Alice permits Bob to update rating, but he cannot modify other columns. 
All tuples must be present in the collection. 

VtVrVz [Stock(t, r, i) — > 3 r' R'(t,r' ,i)\ 

MtMr'Mi [R'(t, r' , i ) — > 3r Stock(f, r, *)] 

(6) Alice permits Bob to add a new column such as risk premium to Stock. 
All tuples must be present in the collection. 

VtVrVz [Stock(t, r, i) — ■> 3z R'(t , r, i, z)\ 

\/t\/r\/i\/z [R'{t, r, i, z) — > Stock(f, r, *)] 

(7) Alice permits Bob to remove industry from Stock; that is, Bob can publish 
the projection of Stock on ticker and rating, which must be complete. 

VtVj’Vz [Stock(f, r, i) — > R'(t, r)] 

VtVr [. R’(t , r) — > 3 i Stock(t, r, *)] 

4 Applying the Formal Model 

Signatures modeled with constraints are a critical tool that Carol can use to 
evaluate the integrity of the data R' provided by Bob. In data exchange scenarios, 
Carol usually needs to access the data in terms of queries. We consider two such 
scenarios. First, provided with R' , Carol may have in mind a query q' over R' . In 
order to assess the integrity of the query answer q'(R'), she would like to relate 
it to R, and she does so by using the constraints provided by the signatures. She 
characterizes the integrity of the result by computing the query q over R such 
that the query results match: 

Problem 1 (Characterizing integrity of query result). Given Alice’s signature 
object S Alice a n( l a query o' over R' what is a query q over R such that 
y^y Alice (S AHce ,R') \= q(R) = q\R')i 
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A dual problem results if we suppose Carol’s goal is to answer a query q over 
R. She must use R' to answer q and to do so, she must understand the impact 
of the constraints that hold between R and R' . We formalize this answerability 
problem as follows: 

Problem 2 (Exact answerability over signed data). Given Alice’s signature object 
S Alice and a query q over R, does there exist a query q' over R' such that 
\/ER\FY AUce(S Alice, R') \= q'(R') = q(R)1 

Common to both problems is a basic decision problem: 

Problem 3 (Equivalence decision problem). Given Alice’s signature object S Alice, 
a query q over R, and a query q' over R' , decide whether VERIFY/Uice^AZzcei R') 

b g(R) = «(#)• 

These problems seem nearly impossible to solve without reference to the for- 
mal model we have presented. However, if we invoke the constraint statements of 
Sec. 3 then it is clear that Carol must verify each signature object, and collect the 
implied constraints into a set of constraints C. We can then use well-understood 
techniques of query answerability [8] or clrase/backchase [15] to solve these prob- 
lems. We illustrate here with one special case, which is a direct application of a 
result in [15]. 

Theorem 1. Let C be the embedded dependencies enforced by Alice’s signature 
S Alice- F° r a query q, let chasedq) denote the result of applying the chase tech- 
nique of [15], if it exists. Then: 

1. Suppose chasedq') exists. Then one can solve Problem 1 by performing a 
backchase on chasedq')- This problem is NP-complete. 

2. Suppose chasedq) exists. Then one can solve Problem 2 by performing a 
backchase on chasedq )■ This problem is NP-complete. 

3. Suppose chasedq) an d chasedq') exists. Then one can solve Problem 3 
by checking the query equivalence chasedq) = chasedq')- This problem is 
NP-complete. 

5 Related Work 

Throughout the paper we have referred to the homomorphic signature schemes 
[18,12,9] and tree-based signature schemes [10,11,7,6] on which this work de- 
pends. The authors of [4] use the W3C XML Signature as a format for im- 
plementing “content extraction signatures” which allow an author to sign a 
document along with a definition of permissible operations of blinding (simi- 
lar to redaction) and extraction. An authorized recipient can blind or extract 
the document and generate a signature without contacting the author. However, 
verification of the signature by a third-party requires contacting the author who 
will verify the extractions were legal and verify the new signature. The authors 
of [3] propose a framework of cooperative updates to a document which are 
controlled according to confidentiality and integrity processes. The drawback of 
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their scheme is that the flow of the document through a sequence of collaborating 
parties must be predetermined by the first author. 

The BAN logic is a formalism for reasoning about the beliefs of parties in 
a cryptographic protocol [5] . The model captures parties’ knowledge and beliefs 
and how they evolve over time as the result of communication. To the best of 
our knowledge, ours is the first attempt to formally model signatures applied 
to structured data, and in particular to relate the semantics of signatures to 
traditional database constraints. 

6 Conclusion and Future Work 

Conclusion. Conventional digital signatures and recent extensions to signature 
techniques are a promising tool for providing integrity in data exchange when 
enforcing conventional access control is not possible. We have introduced a formal 
model for signature semantics based on relational constraints. This model can 
guide the choice of signatures, and is the basis for evaluating queries over signed 
data. In addition, our model unifies disparate signature techniques, providing 
insight into their application, combination and their distinguishing features. 

Future work. It is clear much work remains to complete a practical, expressive 
formalization of signature semantics. In particular, the best data model for ex- 
change scenarios is likely to be semi-structured. Some of the signature techniques 
above can be applied to XML, and it remains to extend our present insights to a 
constraint language over XML like that proposed in [15]. The existing signature 
techniques we have described provide a basic set of primitives. To support real 
data exchange scenarios, new signature techniques need to be developed. 




Fig. 5. Integrity for data integrated from multiple parties. 



Furthermore, throughout the discussion we have concentrated on managing 
the integrity of data authored by Alice alone. Our eventual goal is to characterize 
the integrity of data authored by multiple parties and integrated by Bob. Such 
a scenario is pictured in Fig. 5 where distinct data sources publish databases 
and R 3l and Bob publishes Q(Ri, R 2 , R 3 ) along with some signatures. 
Here the challenge is for Carol to verify the integrity of the integrated results 
offered by Bob. A formal model of signatures is a basic prerequisite to tackling 
this generalization of the single-author case. 
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Abstract. The recent investigation of privacy-preserving data mining 
and other kinds of privacy-preserving distributed computation has been 
motivated by the growing concern about the privacy of individuals when 
their data is stored, aggregated, and mined for information. Building 
on the study of selective private function evaluation and the efforts to- 
wards practical algorithms for privacy-preserving data mining solutions, 
we analyze and implement solutions to an important primitive, that of 
computing statistics of selected data in a remote database in a privacy- 
preserving manner. We examine solutions in different scenarios ranging 
from a high speed communications medium, such as a LAN or high- 
speed Internet connection, to a decelerated communications medium to 
account for worst-case communication delays such as might be provided 
in a wireless multihop setting. 

Our experimental results show that in the absence of special-purpose 
hardware accelerators or practical optimizations, the computational com- 
plexity is the performance bottleneck of these solutions rather than the 
communication complexity. We also evaluate several practical optimiza- 
tions to amortize the computation time and to improve the practical 
efficiency. 



1 Introduction 

Privacy-preserving data mining, as well as other kinds of privacy-preserving 
distributed computation, is intended to address conflicting goals. On the one 
hand, it is often desirable to extract information from collected data. On the 
other hand, there are often legitimate concerns about the privacy of personal 
data, proprietary data, and other sensitive information. Privacy-preserving data 
mining, in which certain computations are allowed, while other information is 
to remain protected, was first introduced in 2000 by Agrawal and Srikant [2] 
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and Lindell and Pinkas [13]. Since then, extensive research has been devoted to 
privacy-preserving data mining and other privacy-preserving computations effi- 
cient enough to be used on extremely large data sets (e.g., [3, 9, 5, 8, 17, 12, 7, 18, 
10,1,19]). 

In general, this research has been divided into solutions that provide strong 
cryptographic privacy protection, which require more computational overhead 
and have so far been limited to extremely simple (but useful) functions, and 
those that use perturbation, which provide weaker privacy properties, but allow 
much more efficient solutions and allow computation of more sophisticated data 
mining functions. 

Our work provides an experimental evaluation of a cryptographic solution 
presented by the second author and others [5] . They introduced selective private 
function evaluation, a general methodology for efficient privacy-preserving solu- 
tions of computations by a client over data in a remote database. Their general 
solutions can provide efficiency improvements whenever the number of data ele- 
ments involved in the computation is significantly fewer than the total number 
of data elements. As a particular instance, they consider a client/server envi- 
ronment in which the client and the server engage in a secure computation to 
evaluate a statistical function. Their solutions provide strong privacy guarantees, 
and involve encryption as a primary component. 

As a specific selective private function computation, they consider private 
sum computation. In this setting, a client privately performs a sum or weighted 
sum of selected database elements held by the server. This is an important ex- 
ample because such protocols immediately yield private solutions for computing 
means, variances, and weighted averages, which can be useful on their own or 
as part of a larger privacy-preserving distributed data mining protocol. In our 
work, we implement a particular privacy-preserving solution to the private sum 
computation [6]; this protocol is described in more detail below. This protocol, 
as well as some of the others of Canetti et al. [5] , can easily be extended to work 
for multiple distributed databases. 

Our results show that the total running time needed is quite high, but it 
becomes feasible if certain straightforward optimizations are done, such as some 
client precomputation before the actual computation is to be done. Unless spe- 
cial hardware accelerators or practical optimizations are used, the computational 
delay caused by the encryption operations is the bottleneck, while the commu- 
nication delay is significantly less. 

To our knowledge, our implementation is one of the first implementations 
of privacy-preserving database computations. Relatedly, Malkhi et al.’s recent 
implementation [14] of Yao’s general secure two-party computation solution [20] 
provides the first general secure multiparty computation results, and demon- 
strates that many computations on relatively small data sets can be done ex- 
tremely efficiently. Indeed, secure multiparty computation and cryptographically 
strong privacy-preserving database computations, largely considered only theo- 
retical, seem to be on the cusp of practicality as both theoretical and techno- 
logical advances have improved their performance. Therefore, this kind of initial 




Experimental Analysis of Privacy-Preserving Statistics Computation 



57 



experimental work is an important contribution to understanding where such 
results are within the realm of practice and where further improvements are still 
needed. 

In Section 2, we describe the private selected sum problem and our imple- 
mented solution in more detail. We present our experimental results, including 
various practical optimizations that reduce the execution time, in Section 3. 

2 Private Selected Sum Computation 

We consider the simple problem of privately evaluating the sum of a subset of 
numbers. The server holds a database of n numbers. The client is interested in 
the sum of m selected numbers in the database (whose indices it is assumed to 
know, e.g., from some publicly available source), but the client does not wish 
to reveal its selection criteria. The database owner on the other hand wants to 
reveal to the client only the sum and not the individual elements that contribute 
to the sum. 

A privacy-preserving client/server computation must satisfy three require- 
ments [5]. Correctness states that as long as the client and server follow the 
protocol then the client’s output is the correct value. Client Privacy requires 
that a malicious server cannot learn anything from the interaction about which 
values the client has selected to be involved in the computation. Database Pri- 
vacy requires that the client learn only a predefined amount of information about 
the data. 

A trivial but nonprivate solution to this problem is to let the client send 
the to indices in which it is interested to the database server. The server then 
computes the sum of the values at the specified indices and returns the sum 
to the client. While this solution preserves the privacy of the server, the server 
learns the set of indices the client is interested in, thus compromising the client 
privacy requirement. Conversely, another alternative would be for the server to 
expose the database to the client and have the client compute the sum of the 
numbers it is interested in. In this solution, the client’s privacy is preserved but 
the client learns the entire contents of the server’s database, and hence the goal 
of database privacy is not met. 

Secure multiparty computation (SMC) is a powerful cryptographic primitive 
in which two or more parties can jointly compute a specified function of their 
input while hiding their inputs from one another. The problem of securely eval- 
uating the selected sum is a specific example of SMC: the client and server wish 
to jointly evaluate the sum of a selected subset of numbers without the server 
revealing the individual elements or the client revealing the indices of interest. 
General SMC solutions [4,11,20] can provide solutions to the database sum 
problem providing both client and database privacy, but these solutions have 
communication overhead that is at least quadratic in the size of the database, 
which will generally be impractical for large databases. For example, initial re- 
sults of the Fairplay system [14] suggest that straightforward implementation 
of Yao’s solution would require an execution time of at least 15 minutes for a 
database of only 10,000 elements [16]. 
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Canetti et al. [5] present cryptographic privacy-preserving solutions that in 
particular focus on reducing the communication. This focus is justified because 
strong privacy requires at least linear computation, as at a minimum every 
data element must be accessed in order to avoid leaking any information to the 
server. They present both linear-communication and sublinear-communication 
solutions. 

As a starting point for our investigations of the practical performance of se- 
lective private function evaluation, we investigate a simple linear-communication 
solution that provides database privacy and client privacy using semantically se- 
cure homomorphic encryption [6] . Semantic security means that ciphertexts yield 
no information about their plaintexts. (In particular, encryption is randomized, 
and it is not possible to tell from two ciphertexts whether they encrypt the same 
plaintext or different plaintexts.) A homomorphic encryption scheme is an en- 
cryption scheme in which certain efficient computations on ciphertexts, which 
can be computed without knowledge of the plaintexts or the secret key, cor- 
respond to certain computations on plaintexts. For our protocol, we require a 
homomorphic encryption scheme satisfying: E(a) ■ E{b) = E{a + 6), where • and 
+ denote modular multiplication and addition, respectively. It also follows that 
E(a) c = E(a ■ c) for c € N. The Paillier cryptosystem [15] satisfies this property 
and is the cryptosystem of our choice in our implementation. 

In the database sum setting, the server holds a database of n numbers 
Xi, . . . ,x n . The client holds the set of indices I±, . . . , which represent the 
subset of numbers it is interested in. That is, /* is 0 if Xi is to be included 
in the sum computation, and 1 otherwise. (If desired, integer weights in some 
larger range could be used to produce a weighted sum, which in turn could be 
used for a weighted average.) The client has a public encryption key E and the 
corresponding private decryption key D of a homomorphic encryption scheme. 
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Fig. 1 . Selected Sum Protocol 



The private protocol, illustrated in Figure 1, executes as follows. The client 
encrypts its array of indices using the homomorphic cryptosystem and sends the 
encryptions E(I\), -E(/ 2 ), . . . , E{I n ) to the server. The server then computes the 
product n?=i mr- That is, the server takes the itli received encrypted value 
and raises it to the value of its zth data element x^. Then the server multiplies 
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all these values together modulo M, where M is a parameter of the encryption 
scheme. Note that this operation is applied directly to the received encrypted 
values, and does not require decryption nor does it yield any information about 
the cleartexts to the server. By the properties of homomorphic encryption, the 
resulting product is equal to the sum of numbers in the locations specified by 
the client’s indices; that is, 

n / n 

\{E(h) Xi = E 

i=l \i=l 

as desired. The server sends the product to the client, which decrypts it using the 
private key D to learn the desired sum. All operations are performed modulo M, 
where M is a parameter of the homomorphic encryption cryptosystem used. The 
client’s privacy is protected by the encryption of the indices, while the database’s 
privacy is protected because the result sent back is the encryption of the desired 
sum, and does not contain any information about the other database values. 

3 Experimental Results 

We implemented the client/server protocol shown in Figure 1 and measured the 
computation and communication performance. We implemented the protocol in 
Java and C ++ . The Java version uses the Java security package to perform cryp- 
tographic operations and the C++ implementation uses the OpenSSL libraries. 
Cryptographic keys are 512 bits. We experimented across various database sizes 
from 10,000 numbers to 100,000 numbers, with numbers of 32 bits each. On 
average, the performance results from our Java experiments were around five 
times slower than those of similar C ++ experiments; except in Section 3.5, we 
report only the C++ numbers here. 

The experimental data was measured on a High Performance Cluster at 
Stevens Institute of Technology in Hoboken, NJ and on a High Performance 
Cluster at Illinois Institute of Technology in Chicago, IL to measure communi- 
cation complexity over short and long distances, respectively. Communication 
between the client and server was enabled by a 64Gbps switch within the High 
Performance Computing facility at Stevens; communication between the client 
in Chicago and the server in Hoboken used a 56Kbps modem. Our results show 
that despite the longer distance between the client and server and the decelerated 
communication medium, computation time still prevails over the communication 
time, accounting for the bulk of the total running time. 

3.1 Performance Results Without Any Optimizations 

Figures 2 and 3 show experimental results of the direct implementation of the 
solution described in Section 2, without any optimizations. 

In Figure 2, both the client and the server processes ran on 2GHz Pentium-Ill 
processors with 3GB memory, connected by a high-performance gigabit network 
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Fig. 2. Components of Overall Runtime without Any Optimizations over a Short Dis- 
tance 
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Fig. 3. Components of Overall Runtime without Any Optimizations Measured over a 
Long Distance 

switch. Our results illustrate linear time performance, as expected. In this case, 
the bulk of the execution time is attributable to the client computation of the n 
public key encryptions of its index vector. The time for the server’s computation 
is significantly less, followed by the communication time. The client’s decryption 
time is constant (independent of the database size) and negligible since it is 
simply the time taken to decrypt a single encryption (of the desired sum). For 
a database of 100,000 elements, approximately 20 minutes is required for the 
execution. 

Figure 3 shows the results of the experiment carried out over a long dis- 
tance. In these experiments, the client process ran on a 500 MHz UltraSparc 
processor machine in Chicago, IL, and the server ran on a 1GHz Intel Pentium 
processor in Hoboken, NJ. Communication between client and server was via a 
56Kbps dialup connection. As before, the client’s encryption time increases lin- 
early with increase in database size, as does the server’s computation time and 
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Fig. 4. Comparison of Overall Runtimes with and without Batching of Index Vector 
over a Short Distance 



the communication time. As expected, the server’s communication time now be- 
comes a more substantial part of the execution time. However, despite the slow 
communication rate, the computation delay remains more significant than the 
communication delay. 

Our results show that in the absence of any practical optimizations or spe- 
cialized hardware to accelerate client encryption, computation time is the bot- 
tleneck for the algorithm’s performance. In Sections 3. 2-3. 5, we evaluate several 
straightforward practical optimizations. 

3.2 Single-Pass and Pipeline Parallelism 

Noting that both the client computation and the server computation can be done 
in a single pass through their inputs, we implemented “batching” of the client 
processing, in which the client batches its processing of indices into smaller sized 
chunks, performing and sending the encryptions of the indices in each chunk 
before proceeding to the next chunk. On receiving each chunk, the server can 
continue computing the partial product. 

In addition to taking advantage of pipeline parallelism, this approach also 
reduces the memory requirements of both the client and server. At any point 
in time, the client has to allocate memory needed to hold only one chunk of its 
indices rather than the whole index vector. Similarly, the server need only hold 
a single database chunk in memory at one time. The optimal chunk size will 
depend on the relative communication and computation speeds, as well as the 
overhead in processing messages and memory access. In order to achieve max- 
imum parallelization, ideally all three activities (communication of one batch, 
client processing of the next batch, and server processing of the previous batch) 
will require approximately the same amount of time. 

Figure 4 compares the overall runtime of the protocol with and without batch- 
ing of index vector. In our experiments, we took a batch size of 100 elements, 
resulting in approximately a 10% reduction in overall runtime. 
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Fig. 5. Components of Overall Runtime after Preprocessing the Index Vector over a 
Short Distance 



3.3 Preprocessing the Index Vector 

This optimization aims at reducing the computation complexity of the client by 
encrypting the indices offline in advance and storing the encrypted indices. Even 
if the client does not yet know which indices will be 0 and which will be 1, it 
can simply encrypt a large number of 0’s and a large number of l’s to use later. 
When the client needs to send encrypted indices to the server, it can just retrieve 
the appropriate encryptions. The optimization is useful for mobile devices, e.g. 
PDAs, that have limited computing power but reasonable amounts of storage. 

The results of this optimization are shown in Figure 5, with overall on-line 
execution times reduced to about 3^ minutes for a database of 100,000 elements. 
The client’s processing time, now simply to read the stored encryptions and send 
them to the server, is much smaller. All other components remain unchanged; the 
server’s computation time becomes the dominant factor. This experiment was 
conducted on the high performance cluster with a 64Gbps bandwidth switch 
as the communications medium. Hence the delay in communication does not 
assume significant proportions. The reduction in overall runtime is about 82%. 

Figure 6 shows the results observed over a 56Kbps dialup connection with 
the client at Chicago, IL and the server at Hoboken, NJ. In this case, the com- 
munication delay becomes the significant factor. 

3.4 Combination of Optimizations 

The batching of index vector optimization reduces the server’s idle time while 
preprocessing the vector of indices reduces the client’s on-line encryption time. 
Combining these optimizations results in an overall on-line runtime reduction of 
about 94%, as shown in Figure 7. 

3.5 Using Multiple Clients in Parallel 

This alternative aims at reducing the time spent by the client in encrypting the 
index vector by partitioning the task of encryption among multiple clients. The 
challenge is how to protect the privacy of the server while using multiple clients. 
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Fig. 6. Components of Overall Runtime after Preprocessing the Index Vector Measured 
over a Long Distance 
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Fig. 7. Performance Gain Due to Combination of Optimizations over a Short Distance 



In this setting, k clients work in cooperation. Each client is responsible for 
l//cth of the database, and will interact with the server to learn a partial sum 
corresponding to the chosen indices in that part of the database. However, learn- 
ing these partial sums violates database privacy. Accordingly, the server uses a 
randomized blinding to protect the partial sums; the blinding is removed by the 
clients only after the partial sums are combined into a single sum, as shown in 
Figure 8 for k = 3. 

In phase one, k clients C 1 , C 2 , . . . , Ck are involved each holding an index 
vector of size n/k elements. (We assume for simplicity that the database size n 
is a multiple of k.) The clients independently and in parallel choose their own 
encryption keys and interact with the server to learn a blinded encryption of 
the appropriate partial sum. That is, the server chooses random numbers Ri, 
f? 2 , .., Rk such that J2i - 1 -R* = 0 (mod M) (where again M is a parameter of 
the encryption scheme). When computing the product to return to client C), 
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Fig. 8. Multiple Clients (fc = 3) 
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Fig. 9. Performance Improvement Due to Secret Sharing with Three Clients (Java 
implementation) 



the server also computes E(Rt) and multiplies it into the product. This has the 
effect of adding Ri to the partial sum P*. 

In phase two, the clients combine their partial sums and remove the blinding 
factor: 

1. Client C\ sends its blinded partial sum to client Ci- 

2. In turn, each client Ci adds the value received from client Ci_i to its own 
blinded sum and sends the result to client Cj + \. 

3. Client Ck receives the blinded partial sum from client Ck-i, adds it to its 
blinded partial sum to generate the total unblinded sum, and broadcasts the 
result to all the other clients. 

The results in Figure 9 show performance results for k = 3. The overall 
execution time is reduced by a factor of approximately 2.99, which represents 
a 3-folcl improvement, minus a small overhead for the combining phase. Note 
that we implemented multiple clients only for our Java implementation, so these 
performance numbers are significantly higher than those in earlier graphs. They 
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are shown only to indicate the close to 3-folcl improvement. The use of k clients 
would result in approximately a /c-fold reduction in execution time. 

4 Conclusions 

We have analyzed and implemented an instance of selective private function 
evaluation that privately computes the sum of a subset of numbers held by a 
remote database, where the selection of the subset is done by the client. The 
database does not learn anything about which values the client’s computation 
involves, and the client does not learn anything about the values in the database 
other than what is implied by the value of the given sum. 

Our experimental results show that the running time needed is quite high, 
though perhaps feasible in some settings where privacy is considered sufficiently 
important. In a direct implementation, overall running times are around 20 min- 
utes for a database of 100,000 elements in a high-speed communication environ- 
ment. With straightforward optimizations, the running times are only a few 
minutes, and may be within the realm of practice. Unless practical optimiza- 
tions or specialized hardware are used to accelerate encryptions, computation 
delay is the major bottleneck of performance of our implementation. 

It remains open to improve the execution times to scale efficiently to realistic- 
ally-sized databases. As directions for future work, we plan to investigate the 
use of special-purpose cryptographic hardware, as well as methods that give up 
some quantifiable amount privacy in order to achieve significant performance 
improvement s . 
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Abstract. In this paper, we address the problem of protecting the un- 
derlying attribute values when sharing data for clustering. The challenge 
is how to meet privacy requirements and guarantee valid clustering re- 
sults as well. To achieve this dual goal, we propose a novel spatial data 
transformation method called Rotation-Based Transformation (RBT). 
The major features of our data transformation are: a) it is independent 
of any clustering algorithm, b) it has a sound mathematical foundation; 
c) it is efficient and accurate; and d) it does not rely on intractability 
hypotheses from algebra and does not require CPU-intensive operations. 
We show analytically that although the data are transformed to achieve 
privacy, we can also get accurate clustering results by the safeguard of 
the global distances between data points. 



1 Introduction 

Achieving privacy preservation when sharing data for clustering is a challeng- 
ing problem. To address this problem, data owners must not only meet privacy 
requirements but also guarantee valid clustering results. The fundamental ques- 
tion addressed in this paper is: how can organizations protect personal data 
subjected to clustering and meet their needs to support decision making or to 
promote social benefits? 

Clearly, sharing data for clustering poses new challenges for novel uses of 
data mining technology. Let us consider two real-life motivating examples where 
the sharing of data for clustering poses different constraints. 

— Suppose that a hospital shares some data for research purposes (e.g. group 
patients who have a similar disease). The hospital’s security administrator 
may suppress some identifiers (e.g. name, address, phone number, etc) from 
patient records to meet privacy requirements. However, the released data 
may not be fully protected. A patient record may contain other information 
that can be linked with other datasets to re-identify individuals or entities 
[11]. How can we identify groups of patients with a similar disease without 
revealing the values of the attributes associated with them? 



W. Jonker and M. Petkovic (Eds.): SDM 2004, LNCS 3178, pp. 67-82, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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— Two organizations, an Internet marketing company and an on-line retail 
company, have datasets with different attributes for a common set of indi- 
viduals. These organizations decide to share their data for clustering to find 
the optimal customer targets so as to maximize return on investments. How 
can these organizations learn about their clusters using each other’s data 
without learning anything about the attribute values of each other? 

Note that the above scenarios describe two different problems of privacy- 
preserving clustering (PPC). We refer to the former as PPC over centralized 
data, and the latter as PPC over vertically partitioned data. The problem of 
PPC over vertically and horizontally partitioned data has been addressed in the 
literature [13,7], while the problem of PPC over centralized data has not been 
significantly tackled. In this paper, we focus on PPC over centralized data. 

There is very little literature regarding the problem of PPC over centralized 
data. A notable exception is the work presented in [10]. The key finding of this 
study was that adding noise to data would meet privacy requirements, but may 
compromise the clustering analysis. The main problem is that by distorting the 
data, many data points would move from one cluster to another jeopardizing 
the notion of similarity between points in the global space. Consequently, this 
introduces the problem of misclassification 

One limitation with the above solution is the trade-off between privacy and 
accuracy of the clustering results. We claim that a challenging solution for PPC 
must do better than a trade-off, otherwise the transformed data will be useless. 
A desirable solution for PPC must consider not only privacy safeguards, but also 
accurate clustering results. 

To support our claim, we propose a novel spatial data transformation method 
called Rotation-Based Transformation (RBT). The major features of our data 
transformation are: a) it is independent of any clustering algorithm, which rep- 
resents a significant improvement over our previous work [10]; b) it has a sound 
mathematical foundation; c) it is efficient and accurate since the distances be- 
tween data points are preserved; and d) it does not rely on intractability hy- 
potheses from algebra and does not require CPU-intensive operations. 

This paper is organized as follows. Related work is reviewed in Section 2. 
The basic concepts of data clustering and geometric data transformations are 
discussed in Section 3. In Section 4, we introduce our RBT method. In Section 5, 
we discuss and prove some important issues of security and accuracy pertained 
to our method. Finally, Section 6 presents our conclusions. 



2 Related Work 

Some effort has been made to address the problem of privacy preservation in 
data clustering. The class of solutions has been restricted basically to data par- 
titioning [13,7] and data distortion [10]. The work in [13] addresses clustering 
vertically partitioned data, whereas the work in [7] focuses on clustering horizon- 
tally partitioned data. In a horizontal partition, different objects are described 
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with the same schema in all partitions, while in a vertical partition the attributes 
of the same objects are split across the partitions. 

The work in [13] introduces a solution based on security multi-part compu- 
tation. Specifically, the authors proposed a method for k-means clustering when 
different sites contain different attributes for a common set of entities. In this 
solution, each site learns the cluster of each entity, but learns nothing about the 
attributes at other sites. This work ensures reasonable privacy while limiting 
communication cost. 

The feasibility of achieving PPC through geometric data transformation was 
studied in [10]. This investigation revealed that geometric data transforma- 
tions, such as translation, scaling, and simple rotation are unfeasible for privacy- 
preserving clustering if we do not consider the normalization of the data before 
transformation. The reason is that the data transformed through these methods 
would change the similarity between data points. As a result, the data shared for 
clustering would be useless. This work also revealed that the distortion meth- 
ods adopted to successfully balance privacy and security in statistical databases 
are limited when the perturbed attributes are considered as a vector in the 
n-dimensional space. Such methods would exacerbate the problem of misclassi- 
fication. A promising direction of the work in [10] was that PPC through data 
transformation should be to some extent possible by isometric transformations, 
i.e. , transformations that preserve distances of objects in the process of moving 
them in the Euclidean space. 

More recently, a new method, based on generative models, was proposed to 
address privacy preserving distributed clustering [7]. In this approach, rather 
than sharing parts of the original data or perturbed data, the parameters of 
suitable generative models are built at each local site. Then such parameters are 
transmitted to a central location. The best representative of all data is a certain 
“mean” model. It was empirically shown that such a model can be approximated 
by generating artificial samples from the underlying distributions using Markov 
Chain Monte Carlo techniques. This approach achieves high quality distributed 
clustering with acceptable privacy loss and low communication cost. 

The work presented here is orthogonal to that one presented in [13, 7] and 
differs in some aspects from the work in [10]. In particular, we build on our 
previous work. First, instead of distorting data for clustering using translations, 
scaling, rotations or even some combinations of these transformations, we distort 
attribute pairs using rotations only to avoid misclassification of data points. 
Second, our transformation presented here advocates the normalization of data 
before transformation. We show that successive rotations on normalized data will 
protect the underlying attribute values and get accurate clustering results. Third, 
we provide an analysis of the complexity of RBT and discuss a relevant feature 
of our method - the independence of clustering algorithm, which represents a 
significant improvement over the existing solutions in the literature. In addition, 
we show that the computational security of RBT does not rely on formal proof 
of security. Rather, it is based on the amount of computational work required 
to reverse the transformation process. 
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3 Basic Concepts 

In this section, we review the basic concepts that are necessary to understand 
the issues addressed in this paper. 

3.1 Isometric Transformations 

An isometry (also called congruence) is a special class of geometric transforma- 
tions [12,4]. The essential characteristic of an isometry is that distances between 
objects are preserved in the process of moving them in a n-dimensional Euclidean 
space. In other words, distance must be an invariant property. Formally, an iso- 
metric transformation can be defined as follows [4]: 

Definition 1 (Isometric Transformation). LetT be a transformation in the 
n-dimensional space, i.e., T : 3?" — > 5R”. T is said to be an isometric transfor- 
mation if it preserves distances satisfying the following constraint: \T(p) —T{q) \ 
= \p - q\ for all p, q € 5ft ra . 

Isometries also preserves angles and transform sets of points into congruent 
ones. Special cases of isometries include: (1) translations , which shift points a 
constant distance in parallel directions; (2) Rotations, which have a center a such 
that | T(p) — a\ = \p — a\ for all p; and (3) Reflections, which map all points to 
their mirror images in a fixed {d — l)-dimensional plane. 

In this work, we focus primarily on rotations. For the sake of simplicity, we 
describe the basics of such a transformation in a 2D discrete space. In its simplest 
form, this transformation is for the rotation of a point about the coordinate axes. 
Rotation of a point in a 2D discrete space by an angle 9 is achieved by using 
the transformation matrix in Equation (1). The rotation angle 6 is measured 
clockwise and this transformation affects the values of X and Y coordinates. 
Thus, the rotation of a point in a 2D discrete space could be seen as a matrix 
representation v' = Rv, where R is a 2 x 2 rotation matrix, v is the vector 
column containing the original coordinates, and v' is a column vector whose 
coordinates are the rotated coordinates. 

_ cos 9 sin 9 . . 

• r\ f\ \ J - ) 

— sm 9 cos 9 



3.2 Data Matrix 



Objects (e.g. individuals, patterns, events) are usually represented as points 
(vectors) in a multi-dimensional space. Each dimension represents a distinct 
attribute describing the object. Thus, an object is represented as an m x n 
matrix D, where there are m rows, one for each object, and n columns, one 
for each attribute. This matrix is referred to as a data matrix, represented as 
follows: 



D = 



ail ■ • ■ ai k ■ ■ ■ a\ n 
a21 ■ ■ ■ 0,2k ■ ■ ■ 02 n 



(2) 



(2ml • • • & mk • • • & ran 
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The attributes in a data matrix are sometimes normalized before being used. 
The main reason is that different attributes may be measured on different scales 
(e.g. centimeters and kilograms). For this reason, it is common to standardize 
the data so that all attributes are on the same scale. There are many methods 
for data normalization [6]. We review only two of them in this section: min-max 
normalization and z-score normalization. 

Min-max normalization performs a linear transformation on the original data. 
Each attribute is normalized by scaling its values so that they fall within a small 
specific range, such as 0.0 and 1.0. Min-max normalization maps a value v of an 
attribute A to v' as follows: 

, v — miriA , . . 

v = : — x ( newxmaxA — newjminA) + newjmiriA (3) 

max a — mm a 

where miriA and maXA represent the minimum and maximum values of an 
attribute A , respectively, while newjminA and neuumaxA are the new range in 
which the normalized data will fall. 

When the actual minimum and maximum of an attribute are unknown, or 
when there are outliers that dominate the min-max normalization, z-score nor- 
malization (also called zero-mean normalization) should be used. In z-score nor- 
malization, the values for an attribute A are normalized based on the mean and 
the standard deviation of A. A value v is mapped to v' as follows: 



v 



/ 



v — A 

<?A 



(4) 



where A and a a are the mean and the standard deviation of the attribute A, 
respectively. 



3.3 Dissimilarity Matrix 

A dissimilarity matrix stores a collection of proximities that are available for all 
pairs of objects. This matrix is often represented by an m x to table. In (5), 
we can see the dissimilarity matrix Dm corresponding to the data matrix D in 
(2), where each element d(i,j) represents the difference or dissimilarity between 
objects i and j. 

0 

d(2, 1) 0 

d( 3,1) d( 3,2) 0 

d(m, 1) d(m, 2) 0 

In general, d(i,j) is a nonnegative number that is close to zero when the 
objects i and j are very similar to each other, and becomes larger the more they 
differ. 

To calculate the dissimilarity between objects i and j one could use either 
the distance measure in Equation (6) or in Equation (7), or others, where i = 
(xn,Xi2, ... ,Xi n ) and j = (xji,Xj2, ... ,Xj n ) are n-dimensional data objects. 
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n 



v {%ik x jk) ] ^ 

k = 1 


(6) 


n 

d(ij) = ^2 \ Xik x o k \ 


(7) 



k = 1 



The metric in Equation (6) is the most popular distance measure called 
Euclidean distance, while the metric in Equation (7) is known as Manhattan or 
city block distance. Both Euclidean distance and Manhattan distance satisfy the 
following constraints: 

— d(i,j) > 0: distance is a nonnegative number. 

— d(i,i) = 0: the distance of an object to itself. 

— d(i,j) = d(j,i ): distance is a symmetric function. 

— d(i,j) < d(i,k) + d(k,j ): distance satisfies the triangular inequality. 



4 The Rotation-Based Transformation Method 

In this Section, we introduce our method Rotation-Based Transformation (RBT). 
This method is designed to protect the underlying attribute values subjected to 
clustering by rotating the values of two attributes at a time. 

4.1 General Assumptions 

Our approach to distort data points in the n-dimensional Euclidean space draws 
the following assumptions: 

— The data matrix D, subjected to clustering, contains only confidential nu- 
merical attributes that must be transformed to protect individual data values 
before clustering. 

— The existence of an object (e.g. ID) may be revealed but it could be also 
anonymized by suppression. However, the values of the attributes associated 
with an object are private and must be protected. 

— The transformation RBT when applied to a database D must preserve the 
distances between the data points. 

We also assume that the raw data is pre-processed as follows: 

— Suppressing Identifiers. Attributes that are not subjected to clustering (e.g. 
address, phone, etc) are suppressed. Again, the existence of a particular 
object, say ID, could be revealed depending on the application (e.g. our first 
real-life example), but it could be suppressed when data is made public (e.g. 
census, social benefits) . 

— Normalizing Numerical Attributes. Normalization helps prevent attributes 
with large ranges (e.g. salary) from outweighing attributes with smaller 
ranges (e.g. age). The Equations (3) and (4) can be used for normalization. 
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Fig. 1 . Major steps of the data transformation before clustering analysis. 



The major steps of the data transformation, before clustering analysis, are 
depicted in Figure 1. In the first step, the raw data is normalized to give all 
the variables an equal weight. Then, the data are distorted by using our RBT 
method. In doing so, the underlying data values would be protected, and miners 
would be able to cluster the transformed data. There is no need for normalizing 
after the transformation process occurs. 



4.2 General Approach 

Now that we have described the assumptions associated with our method, we 
move on to defining a function that distorts the attribute values of a given 
data matrix to preserve privacy of individuals. We refer to such a function as 
rotation-based data perturbation function, defined as follows: 

Definition 2 (Rotation-Based Data Perturbation Function). Let D mxn 
be a data matrix, where each of the m rows represents an object, and each object 
contains values for each of the n numerical attributes. We define a Rotation- 
Based Data Perturbation function f r as a bijection of n-dimensional space into 
itself that transforms D into D' satisfying the following conditions: 

— Pairwise- Attribute Distortion: \/i,j, such that 1 < i,j < n and i ^ j, the 
vector V = (Ai,Aj) is transformed into V' = (A', A') using the matrix 
representation V' = R x V, where Ai,Aj £ D, A', A' £ D' , and R is the 
transformation matrix for rotation. 

— Pairwise- Security Threshold: the transformation of V into V' is performed 
based on the Pairwise- Security Threshold PST(pl, p2), such that the con- 
straints must hold: V ariance{Ai — A') > p\ and Variance(Aj — A') > p 2 , 
with pi > 0 and p 2 > 0. 

The first condition of Definition 2 states that the transformation applied to a 
data matrix D distorts a pair of attributes at a time. In case of an odd number of 
attributes in D , the last attribute can be distorted along with any other already 
distorted attribute, as long as the second condition is satisfied. 

The second condition (Pairwise-Security Threshold) is the fundamental re- 
quirement of a data perturbation method. It quantifies the security of a method 
based on how closely the original values of a modified attribute can be estimated. 

Traditionally, the security provided by a perturbation method has been mea- 
sured as the variance between the actual and the perturbed values [1,9]. This 
measure is given by Var(X — Y) where X represents a single original attribute 
and Y the distorted attribute. This measure can be made scale invariant with re- 
spect to the variance of X by expressing security as Sec = Var(X — Y) /Var(X) . 
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In particular, RBT adopts the traditional way to verify the security of a 
perturbation method. However, the security offered by RBT is more challenging. 
We impose a pairwise-security threshold for every two distorted attributes. The 
challenge is how to strategically select an angle 9 for a pair of attributes to be 
distorted so that the second condition is satisfied. In Section 4.3, we introduce 
the algorithm that strategically computes the value of 9. 

Based on the definition of the rotation-based data perturbation function, now 
we define our RBT method as follows: 

Definition 3 (RBT Method). Let D mxn be a data matrix, where each of 
the to rows represents an object, and each object contains values for each of 
the n numerical attributes. The Rotation-Based Data Perturbation method of 
dimension n is an ordered pair, defined as RBT = (D, f r ), where: 

— D £ ?ft mxn is a normalized data matrix of objects to be clustered. 

— f r is a rotation-based data transformation function, f r : 3?" — » 3?” 



4.3 The Algorithm for the RBT Method 

The procedure to distort the attributes of a data matrix has essentially 2 major 

steps, as follows: 

Step 1. Selecting the attribute pairs: We select k pairs of attributes A; 
and Aj in D, where i ^ j ■ If the number of attributes n in D is even, then 
k = n/2. Otherwise, k = (n + l)/2. The pairs are not selected sequentially. 
A security administrator could select the pairs of attributes in any order of 
his choice. If n is odd, the last attribute selected is distorted along with any 
other attribute already distorted. We could try all the possible combinations 
of attribute pairs to maximize the variance between the original and the dis- 
torted attributes. However, given that we ditort normalized attributes, the 
variance of any attribute pairs tends to lie in the same range. We illustrate 
this idea in our example presented in Section 5.1. 

Step 2. Distorting the attribute pairs: The pairs of attributes selected pre- 
viously are distorted as follows: 

— (a) Computing the distorted attribute pairs as a function of 9: We com- 
pute F(A',A') = R x V(Ai,Aj) as a function of 9, where R is the 
rotation matrix, defined in Equation (1). 

— (b) Meeting the pairwise- security threshold: We derive two inequations 
for each attribute pair based on the constraints: Variance(Ai — A') > pi 
and Variance(Aj — A' ) > p 2 , with pi > 0 and p 2 > 0. 

— (c) Choosing the proper value for 9: Based on the inequations found 
previously, we identify a range for 9 that satisfies the pairwise-security 
threshold PST(pi, p 2 ). We refer to such a range as security range. Then, 
we randomly select a real number in this range and assign it to 9. 

— (d) Outputting the distorted attribute pairs: Given that 9 is already de- 
termined, we now recompute the substep (a), i.e., V{A' i ,A() = R x 
H(Aj, Aj), and output the distorted attribute pairs. 
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Each inequation in substep (b) is solved by computing the variance of the matrix 
subtraction [A, — A'] . In [5], it is shown that the sample variance of N values 
x\,x 2 , is calculated by: 



1 v 

Var(xi,x 2 ,...,x N ) = — x^2(xi -x) 2 (8) 

i=l 

where x is the arithmetic mean of the values xi,x 2 , 

The inputs for the RBT algorithm are a normalized data matrix D and a 
set of k pairwise-security thresholds T k . We assume that there are k pairs of 
attributes to be distorted. The output is the transformed data matrix D' which 
is shared for clustering analysis. The sketch of the RBT algorithm is given as 
follows: 

RBT_Algorithm 
Input: D mxn ,T k 
Output: D' mxn 

1. k <— \n/2] 

2. Pfc <— k Pairs(Aj, Aj) in D such that 1 <i,j < n and i yt j 

3. For each selected pair P k in Pairs(D) do 

3.1 V (A' , A' ) <— Rg x V (Aj, Aj) / /V is computed as a function of 8 

3.2 Compute(Far(Aj — A') > pi,Var(Aj — A' ) > p 2 ) 

3.3 9 k Security Range(Var(Ai — A') > pi, Var(Aj — A') > p 2 ) 

3.4 V(A' i ,Aj) <— Rg k x V(Ai, Aj ) //Output the distorted attributes of D ' 
End_for 

End_Algorithm 



Theorem 1 . The running time of the RBT_Algorithm is 0(m x n), where m is 
the number of objects and n is the number of attributes in a data matrix D. 

Proof. Let D be a data matrix composed of m rows (objects) and n numerical 
attributes, and k the number of attribute pairs in D to be distorted. 

Line 1 is a straightforward computation that takes 0(1). In line 2, the algo- 
rithm does not select all the possible combinations of pairs. The selection of the 
attribute pairs is performed by simply grouping the attributes in pairs but not 
sequentially. In general, this computation takes n/2 when n is even and (n+l)/2 
when n is odd. Thus, the running time for Step 1 (lines 1 and 2) is O(n). 

The matrix product in line 3.1 takes 2x2 x m. When m is large, line 3.1 
takes O(m). Line 3.2 encompasses two vector subtractions, each one taking m x 
1, resulting in 2 x m iterations. After computing the vector subtractions, we 
compute the variance of these vectors. We scan both vectors once to compute 
their mean since they have the same order. Then we scan these vectors again to 
compute their variance. Each scan takes m x 1. Thus, line 3.2 takes 2 x m+2 x m. 
Therefore, the running time of line 3.2 is O(m). Line 3.3 is a straightforward 
computation that takes 0(1) since one value for 9 is selected randomly. Line 3.4 
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is similar to line 3.1 and takes 0(m). Recall that the whole loop is performed 
at most n times. Thus, the running time for line 3 is 0(n x (to + to + 1 + to)), 
which can be simplified to 0(n x in). 

The running time of the RBT_algorithm is the sum of running times for each 
step, i.e, 0(n + n x m). When to is large, n x to grows faster than n. Thus, the 
running time of the RBT_algorithm takes 0(m x n). □ 



5 RBT Method: Accuracy Versus Security 

In this Section, we analyze some issues of accuracy, security, and privacy per- 
tained to the RBT method. 

5.1 RBT Method: Accuracy 

We illustrate the accuracy of the RBT method through one example. Then we 
show analytically that the accuracy of our method is independent of the database 
size. 

Let us consider the sample relational database in Table 1 and the correspond- 
ing normalized database in Table 2, using Equation (4). This sample contains 
real data of the Cardiac Arrhythmia Database available at the UCI Repository 
of Machine Learning Databases [2] . We purposely selected only three numerical 
attributes of this database: age, weight , and heart-rate (number of heart beats 
per minute). 



Table 1. A sample of the cardiac ar- 
rhythmia database. 



ID 


age 


weight 


heart _rate 




m 


80 


63 






64 


53 






52 


70 






58 


76 


2863 


44 


90 


68 



Table 2. The corresponding normal- 
ized database. 



Ell 


age 


weight 


heart _rate 




1.4809 


0.7095 


-0.3476 




0.4151 


-0.3041 


-1.5061 




-0.4824 


-1.0642 


0.4634 




-1.1556 


-0.6841 


1.1586 




-0.2580 


1.3430 


0.2317 



First, we select the pairs of attributes to distort. Let us assume that the 
pairs selected are: pairl = [age; heart_rate], and pair2 = [weight, age]. Then, 
we set a pairwise-security threshold for each pair of attributes selected: PSTi = 
(0.30,0.55) and PST 2 = (2.30, 2.30). 

After setting the pairwise-security thresholds, we start the transformation 
process for the first attribute pair by computing V' {age ' , heart-rate') — R x 
V (age, heart-rate ): 



cos 8 sin 6 


X 


1.4809 


0.4151 


-0.4824 


-1.1556 


-0.2580' 


— sin 9 cos 8 


—0.3476 


-1.5061 


0.4634 


1.1586 


0.2317 



(9) 
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Note that the vector V' (age 1 , heart jrate') is computed as a function of 9. 
Therefore, the following constraints are function of 9 as well. 

— Variance(age — age') > 0.30 

— V ariance(heart jrate — heart jrate’) > 0.55 

Recall that the values for age and heartjrate are available in the normal- 
ized data matrix in Table 2. Our goal is to find the proper angle 9 to ro- 
tate the attributes age and heartjrate satisfying the above constraints. The 
rotated attributes are age' and heartjrate' . To accomplish that, we plot the 
above inequations and identify the security range, as can be seen in Figure 2. 
In this Figure, there are two lines representing the pairwise-security thresh- 
old PST\ = (0.30,0.55). We identify the security range for 9 that satisfies 
both thresholds at the same time. As can be seen, this interval ranges from 
48.03 to 314.97 degrees. Then we randomly choose one angle 9 in this interval, 
say 9 = 312.47. For this choice, the values of Variance(age — age') — 0.318 
and V ariance{heart jrate — heartjrate') = 0.9805, which satisfies the pairwise- 
security threshold PST\ = (0.30,0.55). 




Angle 0 



Fig. 2. The security range for Var(age — age') and Var {heartjrate — heartjrate'). 



After distorting the attributes age and heart-rate , we now repeat the steps 
performed previously to distort the attributes weight and age. We combine weight 
with age because we need exactly two attributes to be distorted at a time. We 
could combine weight with heart-rate as well. The values of the attribute age 
have been distorted in the previous steps. 

We plot the inequations and identify the security range, as can be seen in 
Figure 3. This interval ranges from 118.74 to 258.70 degrees. Then we randomly 
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choose one angle 9 in this interval, say 9 = 147.29. For this choice, the values 
of V ariance{weight — weight') = 2.9714 and V ariance(age — age 1 ) = 6.9274, 
which satisfies the pairwise-security threshold PST^ = (2.30,2.30). 




Fig. 3. The security range for Var (weight — weight ') and Var(age — age'). 



The cardiac arrhythmia database after transformation is showed in Table 3, 
while Table 4 shows the dissimilarity matrix corresponding to Table 3. 



Table 3. The cardiac arrhythmia 
database after transformation. 



ID 


age 


weight 


heart _rate 


1237 


-1.4405 


0.0819 


0.8577 


3420 


-1.0063 


1.0077 


-0.7108 


2543 


1.1368 


0.5347 


-0.0429 


4461 


1.7453 


-0.3078 


-0.0701 


2863 


-0.4353 


-1.3165 


-0.0339 



Table 4. The dissimilarity matrix corre- 
sponding to Table 3. 

0 



1.8723 


0 






2.7674 


2.2940 


0 




3.3409 


3.1164 


1.0396 


0 


1.9393 


2.4872 


2.4287 


2.4029 0 



Here we highlight an interesting outcome yielded by our method: the dissim- 
ilarity matrix corresponding to the normalized database in Table 2 is exactly 
the dissimilarity matrix in Table 4. This result suggests that RBT method is 
one isometry in the n-dimensional space, independent of the database size to be 
transformed: 

Theorem 2. The RBT method is one isometric transformation in the n-dimen- 
sional space. 
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Proof. By using the concept of distance between objects. 

Let D mxn be a data matrix where m is the number of objects and n is the number 
of attributes. Without loss of generality, the rotation of any two attributes A t 
and Aj in D , where i ^ j, will maintain the distance between the m objects 
invariant. The preservation of such distances is assured because rotations are 
isometric transformations [4,8]. Applying the RBT method to D will result in 
a transformed data matrix D' where all the attributes in D' are transformed by 
successive rotations of an attribute pair at a time. Hence, the RBT method is 
one isometric transformation in the n-dimensional space. □ 

A natural consequence of Theorem 2 is that our transformation method is 
independent of the clustering algorithm. After applying the RBT method to a 
data matrix D , the clusters mined from the released data matrix D' will be 
exactly the same as those mined in D , given the same clustering algorithm: 

Corollary 1 . Given a data matrix D and a transformed data matrix D' by using 
the RBT method, the clusters mined from D and D' are exactly the same for 
any clustering algorithm. 

Proof. By using the concept of dissimilarity matrix. 

From Theorem 2 we know that the distances between the objects in a data matrix 
D is exactly the same as the distances between the corresponding objects in 
the transformed data matrix D' . Hence, applying any distance-based clustering 
algorithm to D and D' will result in the same clusters. □ 

5.2 RBT Method: Computational Security 

Unlike methods in cryptography that requires formal proof of security, the com- 
putational security of RBT is based on the amount of computational work re- 
quired to reverse the transformation process. A brute force attack would require 
a great deal of computational power to get the original data. 

In general, the computational security of RBT is a function which depends 
on the following factors: 

— The selection of attribute pairs: the combination of the attribute pairs is ex- 
tremely important since each attribute pair will lead to a particular security 
range. 

— The order of attribute pairs: the order of an attribute in a pair gives the 
direction of the vectors representing data objects in the n-dimensional space. 

— The selection of pairwise- security thresholds: the lower the pairwise-security 
threshold selected by a security administrator the broader the security range. 

— The selection of the angle 6: the angle 9 for each attribute pair is selected 
randomly in a continuous interval (the security range) . 

In our previous example, the security range for the attribute pairs would 
be completely different if we had selected the pairs as follows: pairl = [weight; 
lreart_rate], and pair2 = [heart_rate, age]. In addition, the order of the attributes 
in an attribute pair will indicate the direction of the rotation in the space. 
Clearly, the computational difficulty becomes progressively harder as the number 
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Table 5. The dissimilarity matrix corre- Table 6. A copy of the dissimilarity ma- 

sponding to Table 3 after normalization. trix corresponding to Table 3 without 

normalization. 



0 




0 


3.0121 0 




1.8723 0 


2.5196 2.0314 0 




2.7674 2.2940 0 


2.8778 2.7384 1.0499 0 




3.3409 3.1164 1.0396 0 


2.3604 2.9205 2.3811 1.9492 0 




1.9393 2.4872 2.4287 2.4029 0 



of attributes in a database increases. Apart from that, it is not trivial for an 
attacker to guess the angle 9 for a particular attribute pair since the security 
range is a continuous interval. Note that the angles selected in our previous 
example are real numbers. 

Based on the four factors above, RBT can be seen as a technique on the bor- 
der with obfuscation. Obfuscation techniques aim at making information highly 
illegible without actually changing its inner meaning [3]. In other words, using 
RBT the original data is transformed so that the transformed data captures 
all the information for clustering analysis while protecting the underlying data 
values. 

Now we show the security of our method against attacks. We know that the 
variances of the attributes in a database are equal to 1 after normalization, using 
Equation (4). For instance, the variances of the attributes in Table 2 are [1.000; 
1.000; 1.000]. On the contrary, the variances of the distorted database in Table 3 
are [1.9039; 0.7840; 0.3122]. Note that although the variances of the attributes in 
Table 2 and Table 3 are different, we know that their dissimilarity matrices are 
exactly the same, as showed in Section 5.1. Even that an attacker who has access 
to the perturbed data also has access to the variances of the original data (nor- 
malized), this attacker cannot reverse the transformation process. The reason is 
that the variances of the original data (normalized) and the variances of the dis- 
torted data are completely different. On the other hand, if this attacker tries to 
normalize the data in Table 3 trying to reverse the transformation process, the 
distances between the objects will be changed as can be seen in the dissimilarity 
matrix in Table 5. In this case, the data normalized after the distortion process 
would be useless and the attempt to reverse the transformation process would 
be frustrated. 

5.3 RBT Method: The Privacy Preservation Process 

The process of protecting privacy of objects through the RBT method is accom- 
plished in three major steps as follows: 
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Step 1: Data Obscuring. First, we try to obscure the raw data by normaliza- 
tion. Clearly, normalization is not secure at all, even though it is one way to 
obfuscate attribute values subjected to clustering. On the other hand, data 
normalization brings two important benefits to PPC: a) it gives an equal 
weight to all attributes; and most importantly b) it makes difficult the re- 
identification of objects with other datasets since in general public data are 
not normalized. 

Step 2: Data Anonymization. We could also anonymize the released data- 
base by removing identifiers from the distorted data. For example, the at- 
tribute ID in Table 3 could be suppressed from the data. In doing so, the 
privacy of individuals would be enhanced. 

Step 3: Data Distortion. Disguising the data by normalization and by anony- 
mization is not enough. So we distort attribute values by rotating two at- 
tributes at a time. Note that RBT follows the security requirements of tradi- 
tional methods for data distortion. The fundamental basis of such methods 
is that the security provided after data perturbation is measured as the vari- 
ance between the actual and the perturbed values. RBT is more flexible than 
the traditional methods in the sense that a security administrator can impose 
a security threshold for each attribute pair before the distortion process. 

6 Conclusions 

In this paper, we have introduced a novel spatial data transformation method for 
Privacy-Preserving Clustering, called Rotation-Based Transformation (RBT). 
Our method was designed to protect the underlying attribute values subjected 
to clustering without jeopardizing the similarity between data objects under 
analysis. Releasing a database transformed by RBT, a database owner meets 
privacy requirements and guarantees valid clustering results. The data shared 
after the transformation to preserve privacy do not need to be normalized again. 

RBT can be seen as a technique on the border with obfuscation since the 
transformation process makes the original data difficult to perceive or under- 
stand, and preserves all the information for clustering analysis. 

The highlights of our method are as follows: a) it is independent of any clus- 
tering algorithm, which represents a significant improvement over the existing 
methods in the literature; b) it has a sound mathematical foundation; c) it is 
efficient, accurate and provides security safeguard to protect privacy of individ- 
uals; and d) it does not rely on intractability hypotheses from algebra and does 
not require CPU-intensive operations. 
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Abstract. DRM systems provide a means for protecting digital content, but at 
the same time they violate the privacy of users in a number of ways. This paper 
addresses privacy issues in DRM systems. The main challenge is how to allow a 
user to interact with the system in an anonymous/pseudonymous way, while 
preserving all security requirements of usual DRM systems. To achieve this 
goal, the paper proposes a set of protocols and methods for managing user iden- 
tities and interactions with the system during the process of acquiring and con- 
suming digital content. Furthermore, a method that supports anonymous transfer 
of licenses is discussed. It allows a user to transfer a piece of content to another 
user without the content provider being able to link the two users. Finally, the 
paper demonstrates how to extend the rights of a given user to a group of users 
in a privacy preserving way. The extension hides the group structure from the 
content provider and at the same time provides privacy among the members of 
the group. 



1 Introduction 

Recent developments in digital technologies, along with increasingly interconnected 
high-speed networks and the decrease in prices for high-performance digital devices, 
have established digital content distribution as one of the most quickly emerging ac- 
tivities nowadays and made possible new ways for consumers to access, use, enjoy, 
and pay for digital content. As a consequence of this trend and big success of one of 
the first online music shops - Apple’s iTunes, which sold more than 70 million songs 
in its first year [ 1], a number of shops have been opened and both consumers and con- 
tent providers have clearly shown high interest in electronic distribution of audio/video 
content. 

However, digital content can be very easy illegally copied, exchanged, and distrib- 
uted, which is seen by the content industry as a big threat. Therefore, content providers 
need a technology, which can protect digital content from illegal use. Digital Rights 
Management (DRM) is a technology that provides content protection by enforcing the 
use of digital content according to granted rights. It enables content providers to pro- 
tect their copyrights and maintain control over distribution of and access to content. To 
fulfill the needs of content providers, a number of DRM systems have quickly ap- 
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peared such as Microsoft Windows Media DRM [2], IBM’s Electronic Music Man- 
agement System (EMMS) [3], Sony’s Open MagicGate [4], and Thomson’s Smar- 
tRight [5]. Early DRM systems have been device-based, which means that they bound 
content and rights to devices so that content can be only accessible at a specific de- 
vice. However, in order to allow a consumer to access his content anytime, anywhere, 
at any device, the idea of person-based DRM has emerged. Furthermore, some DRM 
systems, such as Authorized Domain Digital Rights Management (ADDRM) system 
from Philips [6] [7], take into account along with the requirements of content owners 
also the requirements of content consumers. Philips’ ADDRM allows content to freely 
flow inside a domain (typically a household), so that it can be freely copied inside that 
domain and exchanged among the domain devices, while transactions between differ- 
ent domains are controlled. 

A typical DRM system normally provides means for protecting content, creating 
and enforcing rights, identification of users, monitoring of the usage of content, and so 
on. Therefore, these systems are very privacy-invasive. They violate users’ privacy in 
a number of ways. Firstly, they do not support anonymous and un-linkable buying or 
transfer of content as in the traditional business model where a user anonymously buys 
a CD using cash. Furthermore, they generally involve tracking of the usage of content 
in order to keep control over the content [8]. For example, in person-based DRM sys- 
tems a user has to authenticate himself each time he accesses a piece of content. 
Therefore, information such as user identification, content identification, time, place, 
etc. might be collected. The same holds for device-based DRM system, except that 
user identification might not be revealed in such a straightforward way, although this 
information can be derived from other data that perhaps link unique device identifica- 
tion or content identification with user identification. 

In an increasingly privacy-aware world, such possibilities of creating user profiles 
or tracking users create numerous privacy concerns. In order to overcome the afore- 
mentioned privacy problems with DRM systems, this paper proposes a Privacy- 
Preserving DRM system (P2DRM). The main idea is to allow a user to interact with 
the system in an anonymous/pseudonymous way during the whole process of buying 
and consuming digital content. This has to be done in a way that all security require- 
ments of the usual DRM systems are satisfied, and that content providers are assured 
that content will be used according to issued licenses and cannot be illegally copied. 
Furthermore, the paper discusses an approach to an anonymous transfer of licenses, so 
that a piece of content can be sold or gifted to another user without the content pro- 
vider being able to link the two users. Finally, the paper demonstrates how the basic 
system can be extended to achieve authorized domain functionality in a privacy pre- 
serving way. This extension hides the domain structure from the content provider and 
at the same time provides privacy among the members of the domain. 

The remainder of the paper is organized as follows. In Section 2, the basic system 
is introduced. Section 3 discusses a solution that extends the basic system to support 
an anonymous transfer of licenses. In Section 4, a description of an additional exten- 
sion of the system that allows privacy preserving creation of a domain and its func- 
tioning is given. Finally, Section 5 draws conclusions. 
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2 Basic System 

In the basic privacy-preserving DRM system, the real identity of the user is decoupled 
from identifiers which the user possesses in the system. These identifiers, i.e. user 
pseudonyms, in the P2DRM system are used to link a user (or his ID device, e.g. a 
smart card) to content, thus allowing a user to access the content for which he bought 
the rights. Moreover, the identifiers may be also used to keep track of the behaviour of 
that user ID device, thus preventing that known hacked user ID devices continue to be 
used in the system. 

There are a number of entities which are present in the P2DRM system. They com- 
prise: 

• User, 

• Smart card (SC), the user ID device, 

• Smart card issuer (SCI), 

• Compliance certificate issuer for smart cards (CA-SC), 

• Content provider (CP), 

• Compliant device (CoD), a device that behaves according to the DRM rules, 

• Compliance certificate issuer for compliant devices (CA-CoD). 

There are also a number of threats, which are mentioned below, relating to the se- 
curity of the system and the privacy of the users of this system. These threats are han- 
dled by the P2DRM system by means of schemes which are discussed in the sections 
below. 

The basic privacy threat that P2DRM circumvents is the association of a user’s real 
identity and content that the user owns, association which may happen with the use of 
personal licenses for content access. This also prevents that users are tracked while 
accessing the content. 

General security threats for the DRM system include the possibility of hacking 
smart cards as well as the devices on which content is accessed. These threats are 
avoided in the P2DRM system by means of compulsory mutual compliance checks 
between smart cards and devices. These checks, on their turn, may violate users’ pri- 
vacy which is circumvented with the use of temporary users’ pseudonyms. 

When the transfer of licenses amongst users is made possible, security and privacy 
threats are also present. Security threats relate to the fact that users may be able to 
continue using their licenses after they have transferred those licenses to other users. 
Privacy threats relate to the possible disclosure of the association between the user 
who transfers and the user who receives a given license. These threats are avoided in 
the P2DRM system by means of revocation lists and generic (or anonymous) licenses 
issued by the CP. 

Finally, concerning the composition of a group (or domain) of users who are al- 
lowed to share licenses (e.g., users in a household), the basic privacy threat is the 
disclosure of the domain structure. P2DRM avoids this threat by means of a trusted 
domain manager device. 

Within the P2DRM system, a number of different transactions, schematically de- 
picted in Figure 1, are performed by/with the entities listed above. These transactions 
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are described in the sections below, where references to the numbered links in Fig- 
ure 1 are made at the appropriate points. Moreover, several security assumptions are 
made, which are indicated for each phase of usage of the DRM system. 




Fig. 1. Representation of the various interactions between parties in the P2DRM system. 



2.1 Acquisition of a Smart Card by the User 

The user buys a smart card from a retailer which is taken from a pool of identically 
“looking” smart cards pre-issued by the SCI. Each smart card has a different secret 
public/private key pair PK/SK in it and an un-set PIN (say, all PINs are set initially to 
0000). The SCI guarantees that until anyone interacts with the card for the first time, 
the public key of that specific card is not revealed to any party, nor is a PIN (used to 
activate the card) set. So, in this way, the user (as the first interacting party) is the only 
entity which can learn the public key (and therefore know the association between the 
real user identity and PK) and which can set the PIN used to activate the card. Note 
that the private key SK is securely stored on the smart card and it is not accessible to 
the user. 

Security assumptions in this context are: 

• Only after the first transaction, is the public key PK of a SC revealed and the 
PIN number set. 

• The private key corresponding to public key PK is stored secretly and only 
known to the SC. 

2.2 Acquisition of the Content and the Rights by the User 

When the user wants to buy the rights to access some content, he contacts the CP by 
means of an anonymous channel requesting the rights to a given content. After an 
anonymous payment scheme is conducted (such as the pre-payment scheme described 
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in [9]), the user sends his public key PK to the CP (link 1 in Figure 1), which can then 
create the right or license for that content. The content itself is encrypted by the CP 
with a symmetric key Sym and sent to the user together with the license (link 2 in 
Figure 1), whose format is given in (1). The channel must be also secret to prevent that 
an eavesdropper associates the public key PK to the sent license. 

{ PK[Sym//Rights//contentID] } signCP . (1) 

In the license above, PK encrypts the concatenated values [Sym//Rights//contentID], 
Rights describe the rights bought by the user, contentID identifies the content and 
signCP is the signature of the CP on the certificate. 

Given that PK encrypts the value [Sym//Rights//contentID], the SC is the only en- 
tity which is capable of obtaining the key Sym from the license by using the private 
key SK (only known to the SC). Moreover, a compliant SC (as attested by the compli- 
ance certificate discussed in the next section) will reveal the key Sym only to the CoD 
during the action of content access discussed in section 2.4. The license in (1), when 
inspected, does not reveal the public key PK nor the rights, nor content identifier, so it 
preserves the user’s privacy with respect to content and rights ownership. Therefore, if 
found in a user’s storage device, it does not compromise the user’s privacy. 

Note that during the buying procedure the CP learns the association 
(PKf->(contentID, Rights, Sym)), but not the real user’s identity due to the anonymous 
channel. 

Security assumptions in this context are: 

• The user contacts the CP by means of an anonymous channel. 

• There is in place a mechanism which allows the user to pay anonymously for the 
license it requests. 

• A secret channel is used for the communication between the user’s SC and the 
CP. 

2.3 Acquisition of SC Compliance Certificate by the User 

In order for a user to securely access content on a CoD, a compliance certificate for 
his SC must be shown to the CoD. This compliance certificate does not contain, how- 
ever, the public key PK, but it is issued by the CA-SC with a changeable SC’s pseudo- 
nym. To obtain the compliance certificate for the SC, the user/SC contacts the CA-SC 
anonymously, sends its public key PK (link 3 in Figure 1) and asks for the certificate. 
Again, a secret channel is used between the SC and the CA-SC to prevent eavesdrop- 
ping. Assuming that the SCI keeps track of smart cards’ behaviour by means of a 
revocation list with the PKs of hacked SCs, the CA-SC checks with the SCI whether 
PK belongs to the black revocation list or not 1 . If it does not, the CA-SC then gener- 
ates a pseudonym for the SC, say a random number RAN, and issues the following 
compliance certificate, which is sent to the SC (link 4 in Figure 1): 



1 This check of the revocation list for SCs may be performed by the CP as well when the 
anonymous user sends his PK and asks for a license for a given content. If the SC has been 
revoked by the SCI, the CP can refuse the issuance of the license for that SC/user. 
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{H(RAN) , PK[RAN] } signCA sc , (2) 

where H( ) is a one-way hash function, PK encrypts RAN and signCA-SC is the signa- 
ture of the CA-SC on the certificate. 

The certificate in (2), when inspected, does not reveal the public key PK nor the 
SC’s pseudonym RAN. Moreover, the only entity which can obtain RAN from the 
certificate is the SC (via decryption with the private key SK). The value RAN may 
then be checked by a verifier via the hash value in the certificate. The use of a pseudo- 
nym RAN allows the verifier to check the compliance of the SC without learning its 
public key PK. Moreover, linkability of different shows of a given SC’s compliance 
certificate can be minimised. This is due to the fact that frequent renewal of compli- 
ance certificates is a requirement of the DRM system, since it implies that the compli- 
ance of the SC is frequently checked and certified by the CA-SC. Frequent renewal 
can be achieved by including a validity date in the compliance certificate, and when 
this date has passed, the SC is obliged to obtain a new compliance certificate to show 
to the CoD. By renewing the value RAN every time a new certificate is obtained, 
linkability of different certificate shows towards the CoD is minimised. 

In order to prevent linkability of pseudonyms, there are methods such as the con- 
vertible credentials of [10], which allow a user to obtain a credential from a given 
organization under a given pseudonym, and show that credential to another organiza- 
tion under another pseudonym. This type of approach involves protocols which are 
significantly more complex than the simple protocols described in this paper, which 
involve only simple hash operations. 

Note that during the procedure above the CA-SC learns the association 
(PK<->RAN), but not the real user’s identity due to the anonymous channel. 

Security assumptions in this context are: 

• The user contacts the CA-SC by means of an anonymous channel. 

• A secret channel is used for all communication between the user’s SC and the 
CA-SC. 

• The SCI is responsible for keeping track of SC’s behaviours. 



2.4 Access to Content by the User 

Now the user can access the content for which he has the license, which can only be 
performed on a CoD (a device that behaves according to the DRM rules). To do so, he 
must either carry the content and license with him (e.g. in an optical disk) or have 
them stored in some location over the network. In either case, the encrypted content 
and the license (link 6 in Figure 1) must be first transferred to the CoD. Moreover, 
since the user is now physically present in front of the CoD, his real identity may be 
"disclosed” to the CoD (e.g., the CoD may have a camera) or to any observer that may 
also be physically present near the CoD. Therefore, in order to prevent the disclosure 
of the association between the user’s real identity and PK to any party, the public key 
PK of the user should not be revealed to the CoD at the time of content access. That is 
the reason why the compliance certificate for the SC is issued with the changeable 
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pseudonym RAN. Upon check of that certificate, the CoD learns RAN but does not 
learn PK. The full content access procedure is described below. 

Before the SC and the CoD interact with one another, they do a mutual compliance 
check: 

• Compliance of the CoD is proved by means of a CoD compliance certificate. 
This certificate is issued by the CA-CoD, which certifies the public key of the 
CoD, and sent to the CoD (link 5 in Figure 1) beforehand. Upon mutual compli- 
ance check, the certificate is shown to the SC (link 8 in Figure 1). The SC must 
therefore store the public key of the CA-CoD. This key may be changed periodi- 
cally, which obliges the CoD to periodically renew its compliance certificate. 
This also implies that the SC must renew that key periodically, what can be done 
at the time that the SC obtains its own compliance certificates with the CA-SC. 

• Compliance of the SC is proved by means of the pseudonymous compliance cer- 
tificate in (2) which is shown to the CoD (link 7 in Figure 1). As mentioned 
above, the SC obtains the value RAN via decryption with the private key SK and 
sends it to the CoD which checks the value via the term H(RAN). Since the CoD 
can have a clock, the SC compliance certificate may have its time of issuance 
added to it, which obliges the SC to periodically renew the certificate when it 
gets too old. Note that it is also in the interest of the SC to renew its compliance 
certificate often enough so as to minimise the linkability mentioned above. 

After the mutual compliance check, the CoD sends the term PK[Sym//Rights//con- 
tentlD] from the license to the SC (link 9 in Figure 1) which decrypts it and sends the 
values Sym, Rights and contentID back to the CoD (link 10 in Figure 1). The CoD can 
then use Sym to decrypt the content and give the user access to it, according to Rights. 

Note that during the procedure above the CoD learns the association 
(RAN<->(contentID, Rights, Sym)), and may learn the real user’s identity. Therefore, 
an attacker in control of the CoD may be able to obtain the real user’s identity (e.g., a 
picture of the user), his SC’s pseudonym RAN as well as the ID of the content which 
was accessed by the user during that transaction and the accompanying rights 2 . This 
fact, however, compromises the user’s privacy only concerning the specific content 
and rights involved in that transaction. This type of attack cannot be really avoided. 
However, the attacker cannot learn PK, but only the value RAN. As this value changes 
often, the user may be tracked but only for a limited number of transactions. 

Security assumptions in this context are: 

• The CA-CoD is responsible for keeping track of CoD’s behaviours as well as for 
issuing compliance certificates for those devices. 

• A compliant SC will only reveal the decryption key Sym to a compliant device 
(CoD). 

• The CoD will not reveal the key Sym to any party, except for perhaps another 
(proven) compliant device. 



2 Note, however, that Sym is not revealed to the attacker since the CoD is assumed to be DRM- 
compliant. 
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3 Anonymous Transfer of Licenses 

In order for a user (from now on referred to as first user), whose public key is PK, to 
transfer his license to a second user, whose public key is PK’, in a secure (i.e., in a 
way that prevents the former user from still being able to access the content) and 
anonymous way, solutions must be found which deal with license revocation and ano- 
nymity. These are discussed below. 

3.1 License Revocation 

When the first user wants to transfer his license, he contacts the CP via an anonymous 
channel, authenticates himself as user PK, presents the license to be transferred to the 
second user and provides the public key PK’ of the second user. Note that here the 
transfer is not anonymous. The CP marks that license of PK as “revoked”, but before 
the CP creates a new license with PK’, revocation of the old license must be dealt 
with. 

The revocation problem above can be solved by including in the compliance certifi- 
cate of the first user’s SC a list with all the licenses of that user that have been marked 
as “revoked” by the CP (i.e., a black revocation list). This can be done during the 
protocol between the SC and the CA-SC, in which the SC obtains his compliance 
certificate as given in (2). During this protocol, the CA-SC contacts the CP, sends PK 
and asks for the list of all revoked licenses corresponding to that PK. Since the sym- 
metric key Sym that encrypts the content is unique per license, the CP can use this 
value to identify each revoked license associated with PK. The CP then creates a list 
with the values: 

H( Sym_l // Time ), 

H( Sym_2 // Time ), 

v*-V 

H( Sym_n // Time ),• 

where each value is the hash of the key Sym_i of a revoked license concatenated with 
the current time. The one-way hash function H( ) is used to reduce the size of each 
term in the revocation list in (3) but also to hide the values of Sym_i from any party 
which does not need to learn those values. The current time is concatenated with each 
Sym_i in order to prevent the linkability via the revocation list of compliance certifi- 
cates issued for PK in different occasions. 

Once the values for all revoked licenses of PK are included in the list, this list is 
sent by the CP to the CA-SC together with the value Time. At this point, the CP can 
consider as “dealt with” the revocation of the licenses of PK which had been previ- 
ously marked as “revoked”. In this case the CP can create, for instance, the new li- 
cense for the second user with his public key PK’. 

The CA-SC on the other hand can now include the revocation list as well as the 
time in which this list was created in the SC’s compliance certificate as 

(H(RAN), PK[RAN], Time, H(Sym_l//Time), H(Sym_2//Time),... } signCA . sc , (4) 
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where the terms H(Sym_l//Time), H(Sym_2//Time),... refer to all revoked licenses of 
PK at time ‘Time’. 

The certificate above is then sent to the SC, which may keep it stored in the SC it- 
self. At the present time, a typical SC (with public key PK) may store a compliance 
certificate whose revocation list has up to around five hundred revoked licenses of that 
PK. When/if the revocation list becomes too big that storage in the SC is no longer 
possible, the certificate can be stored, for instance, on a server in the network or on an 
optical storage medium, pretty much like the storage of the content and the license 
mentioned previously. 

As discussed in the Section 2.4, when a user requests access to content on a CoD, 
the content plus license must be first transferred to the CoD. And since the SC must 
always prove its compliance to the CoD upon a user’s request to content, it must pre- 
sent the compliance certificate as given in (4). So, after the mutual compliance check, 
the CoD sends the term PK[Sym//Rights//contentID] from the license to the SC which 
decrypts it and sends the values Sym, Rights and contentID back to the CoD. But 
before the CoD uses Sym to decrypt the content and give the user access to it (accord- 
ing to Rights), it calculates H(Sym//Time) and checks whether this value is in the 
revocation list or not. If it is not, the CoD then proceeds with the handling of the ac- 
cess request. 

3.2 Anonymous Licenses 

When the license is transferred from the first to the second user, the CP learns the 
association between those users, i.e., the association between the public keys PK and 
PK’. The knowledge of this association may be unwanted by the users. A solution to 
this problem is the use of generic licenses, from now on referred to as “anonymous 
licenses”, in which a user identity is not specified. 

An anonymous license is a license for a specified content with specified rights (as 
the license given in (1)), but which is not associated with a user (i.e., with a public 
key). Such a license can be issued by the CP for any anonymous user who pays for a 
given content with given rights. It can also be issued for the first user who requested 
the revocation of his license in order for it to be transferred to the second user (as 
described in Section 3.1). Since the license is not associated with a given person, it can 
be transferred (given, sold, etc) to any other person. This person can later present the 
license to the same CP to be exchanged for a personalised license as given in (1), 
which can then be used for content access. The procedure is shown schematically 
below. 



First user 


— > 


Second user 




Anon Lie 




Request 4- 1 Anon. Lie. 




Anon. Lie. 4 T Pcrs. Lie. 


CP 




CP 



Fig. 2. Schematic representation of the anonymous license transactions between the users and 
the CP. 
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For security reasons, however, before the CP issues the anonymous license, a 
unique identifier must be assigned to it. This is done in order to prevent that, once the 
anonymous license has been already redeemed, any copy of it (which can be easily 
made by the user) can be also redeemed. If this identifier is chosen by the CP, how- 
ever, it will be able to link the public keys of both users (the one who transfers and the 
one who later redeems the anonymous license). In order to prevent that, blind signa- 
tures [11] can be used as described below. 

The first user creates a secret random identifier ID, blinds this value (by, e.g., mul- 
tiplying the value ID by another randomly chosen value) and sends it to the CP. To- 
gether with this blinded value, the user may also send a specification of the new rights 
NewRights which are to be associated with the anonymous license (in case the license 
is being transferred between users), provided that the specified rights allow less than 
the original rights. This possibility allows a user to give one of his licenses to another 
user but with more restrictive rights than the original rights he had, if he so wishes. 

The CP, on the other hand, must have a unique pair of public/private keys for each 
combination of right and content {Rights, contentID}. It is assumed here that the set of 
all rights is pre-specified comprising, say, R rights and the set of all content has C 
items. This means that the CP must have RxC different public/private key pairs. Given 
this setting, once the CP receives the data { Blind[ID] , NewRights } from the first 
user, it can sign the blind identifier, Blind[ID], with the private key for the combina- 
tion {NewRights, contentID] and return to the user the value { Blind[ID] } signed _ 

NewRights-contentiD- The user then un-blinds the signed identifier to obtain {ID] signed . 



NewRights-contentiD 



and can give this value, together with the license specification {Ne- 



wRights, contentID], to the second user. 

In order to later obtain a personalised license, the second user contacts the CP 
anonymously, authenticates himself with his public key PK’ and sends to the CP the 
signed identifier {ID] signed . NewRights . contentID together with {NewRights, contentID}. 



The CP can then find the correct key pair, check its own signature in the value ID, and 
if correct it can finally issue a personalised license to the second user (which is sent to 
him together with the content encrypted with a personalised key Sym ): 



{ PK’[ Sym’ //NewRights // contentID ] } signCP . 



( 5 ) 



After the issuance of the license above, the value ID is entered by the CP into a list 
of used IDs. This prevents the personalised license request for an already redeemed 
anonymous license. 

As mentioned previously, one application of anonymous licenses is the unlinkable 
transfer of licenses between users. In this case, when the revocation of the old license 
of the first user is dealt with, the CP simply issues an anonymous license for that user, 
rather than issuing a new license with the public key of the license receiver. Another 
application relates to the business model of giving an incentive for users to buy a cer- 
tain content, for instance, the “buy one, get a second one for free” model. The second 
license can be issued as an anonymous license which can be transferred to any person. 




Privacy-Preserving Digital Rights Management 93 



3.3 Identity-Based Cryptography for Key Management by the CP 

In the solution described above, the CP has to maintain a huge list with RxC different 
public/private key pairs and the corresponding “Rights” and “contentID” values. This 
solution can be simplified with techniques from identity-based cryptography, in par- 
ticular the identity-based blind signature method described in [12]. This method can be 
applied in the present context, but instead of using the identity of people or different 
parties to generate the keys (as proposed in [12]), the concatenation of the content 
identifier, the rights and the CP’s name can be used for key generation. In this way, a 
public key can simply be defined as the string [ContentID//Rights//CPname] and the 
corresponding private key is generated based on that string and on a master key gener- 
ated by the CP. 



4 Privacy-Preserving Domain Creation 

According to the basic idea of Authorized Domains [6] [7], when a user of the P2DRM 
system buys a piece of content, other users in his home may be allowed to access that 
content as well. However, while supporting the authorized domain idea, the system 
should preserve the user’s privacy. Ideally, the system should support that within a 
domain different users have different rights for the same piece of content. Further- 
more, the structure of the domain should remain private. This means that no parties in 
the system, except maybe the one that is responsible for the creation of domains, 
should be able to link the domain members (or their identifiers) together, as well as 
with a domain identifier. Therefore, the first problem addressed in this section is to 
preserve privacy of the domain structure and allow differentiations of the rights inside 
a domain. 

The second problem addressed is the management of countable rights in the 
P2DRM system. The countable rights are rights that the user can spend (as play n 
times, or copy once). These rights are dynamic, because they change over time. This 
causes a problem in the P2DRM system, as the licenses are in the form of certificates 
signed by a content provider (CP). Therefore, the usage of countable rights requires 
that the license, which the user gets from the CP, be changed and signed again every 
time the user spend the rights. Furthermore, the system has to support revocation of 
the old licenses, because the user can easily copy the license before spending the 
rights, spend the rights, then delete the changed license, and finally use the copy of the 
old license to spend the rights again. However the privacy problem remains, and this is 
how to prevent the CP from learning the time, content, device, and user’s PK for each 
user action that involves countable rights. 

In this section, we address these two problems: (1) privacy preserving domain for- 
mation, and (2) privacy preserving management of countable rights. 

The solution to the aforementioned problems is presented step by step. In the fol- 
lowing two sections we first provide two simple straightforward approaches to the 
solution of the problems. By describing their drawbacks, we emphasize the privacy 
problems. Then in the third section, we explain a privacy-enhanced solution that over- 
comes the privacy problems found with the previous solutions. 
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4.1 Domain with PK D 

In this section, we describe a solution for the domain construction in the P2DRM sys- 
tem based on a shared domain key PK D . The domain has to be registered with a do- 
main authority (e.g. the municipality), which can check that indeed the members form 
a group, e.g. family. The same domain authority can assign a PK D to that group of 
users and add SK D to their smartcards 3 . Having done that, a user can buy content for 
his personal use (using his personal PK) or for the whole domain using the domain key 
PK d . In the case of buying content for the whole domain, the user distributes the con- 
tent and the license to the other users of the domain. They will use it in the same way 
in which they use their personal content (as they can choose between two keys in their 
smartcards, one for personal content and another one for the shared content) 4 . How- 
ever, when buying content with PK D , there is no possibility to assigned different rights 
to different members of the domain, as they use the same license. 

With respect to countable rights, a device (which the user operates) can contact the 
CP online when the spending of the rights occurs. The CP can issue a new license and 
revoke the old license when the spending of countable rights occurs. Revocation can 
be done as described in Section 3. However, inserting old licenses into a black revoca- 
tion list only when user goes to the Certificate Authority which certifies compliance of 
smart cards (CA-SC) might not be an effective measure, because this does not happen 
immediately after spending countable rights, but periodically. Therefore the user will 
be able to use old licenses until he is not forced to obtain a new smart card compliance 
certificate. Even if this problem is solved, the following privacy problem remains. As 
the CP must change and revoke a license at the time the user spends rights, the CP will 
learn when, which user (PK), what content, and on which device was used. 

With the solution described above, we have achieved privacy towards the CP for 
the domain structure, because the CP does not learn the structure of the domain. How- 
ever, there are no rights differentiations within the domain, which means that all 
members have the same rights as the user who has bought the content, or all others 
have no rights. Furthermore, the CP knows the time, content, device, and user’s PK 
for each user action that involves changing of countable rights. Finally, the solution 
for countable rights revocation is not appropriate, because it will be too late to include 
the revocation list when the user go to CA-SC for a new RAN (until that he can copy 
content n times instead of only once). 



3 To do that, the domain authority has to contact the smart card issuer for the secret keys, 
which will allow insertion of the domain keys into the secure storage of the smartcards. Note 
that in this process the smartcard issuer will not learn the structure of the domain (e.g. the 
domain authority will not send a separate request to the smartcard issuer for each domain, but 
can do for several domains in a bulk). Note also that there is no need for the domain authority 
to learn the user personal public keys. Therefore the unlinkability between personal PKs and 
user identities will remain. 

4 To facilitate the management of licenses, a license can be accompanied by an identifier, 
which defines if the license is personal or shared in the domain. The smartcard can use this 
information to choose the right key (SK D or the personal SK) when decrypting a license. 




Privacy-Preserving Digital Rights Management 95 



4.2 Domain with Different Rights 

The user who is buying the content may want to assign different rights to different 
members of his domain. For that, the user may create a data structure, 

PKj Rights, , 

PK, Rights, , 

2 2 ( 6 ) 

PK n Rights n , 

where PK,, PK 2 , PK n , are the public keys of the domain members (possible 
including the user who buys), while Rights,. Rights, Rights n are different rights 
expressions. 

The list above is sent to the CP in the process of content buying. The process is 
similar to the one described in Section 2. The difference here is that the CP checks 
with the domain authority if the group of keys from the list really forms a domain. The 
CP can do that interactively with a domain authority. On the other hand, the user can 
also obtain a certificate from the domain authority, certifying that {PKj, PK 2 , ... PK n } 
are in the same domain. In both cases if the CP is assured, he can (using the list (6)) 
directly create for each domain member i a personal certificate: 

{ PK ; [ Sym // Rights; // contentID ] } signCP . (7) 

Note that the key Sym is the same for all users in the domain for a given content, 
therefore only one copy of Sym[content] needs to be kept for the whole domain. Each 
user in the domain gets only a personalised rights package. Note also that the provider 
will create the certificates as in (7) only if the condition (Rights i < Rights) holds, 
where Rights are the rights that have been bought. 

The assignment of rights to domain users may happen at a later stage as well. In 
this case, the data structure as in (6) is sent to the provider along with PK and conten- 
tID of the content to be shared in the domain. The CP can check that indeed PK 
bought contentID, and can check with the domain authority the keys of the domain 
members. The CP may then create and send back to the purchaser the certificates as 
given in (7). 

With this solution, we have achieved rights differentiations within the domain 
without using an extra key (the domain key PK D ). However, consumers lose their 
privacy towards the CP regarding the domain structure, because the CP learns the 
structure of the domain (not the real user identities, but the PKs which form the do- 
main). This can be used for advertisement, spam, etc. Problems regarding behavioral 
privacy and revocation of countable rights remain as described in the previous subsec- 
tion. 

4.3 Domain with PK D and Different Rights 

Let us assume that the users who form a domain have been registered by a domain 
authority and obtained PK D /SK D as described in Section 4. 1 . When buying a piece of 
content, the user with public key PK obtains the following master certificate: 
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{ {PK D [Sym//Rights//contentID},l } signCp ,MR} signCp . (8) 

The master license consists of the domain license shown in (9) and the master 
rights tag (MR), signed all together by the CP. The domain license consists of sym- 
metric key, master rights, and content ID encrypted by the domain key PK D as well as 
the delegation tag (set to 1), signed all together by the CP. 



{ PK D [Sym//Rights//contentID } , 1 } signCP . 



(9) 



At the end of the process of obtaining this certificate from the CP, the user can en- 
crypt the master certificate (as in (10)) in order to preserve his privacy towards the 
domain members who share the PK D . So, no user in the domain will be able to see the 
license and rights of the user who has bought the content. 

PK[{ { P K d [S ym//Rights/ / contented } , 1 } signCP ,MR } signCP ] . (10) 



To create license(s) for domain member(s), the master license has to be supple- 
mented with licenses for particular domain members. The creation of personalized 
user rights (for particular domain members) is done by a Domain Manager device 
(DM). The user who has bought the content prepares the rights for other domain users 
(structure (6) in Section 4.2) and sends it together with the master license to the DM. 
In the interaction with the DM, the user decrypts the encrypted certificate (10) and 
consequently the term PK D [Sym//Rights//contentID]. The user has to show to the DM 
also certificates that attest that all PK ; that are mentioned in the structure (6) (for 
which he wants to prepare licenses) actually belong to his domain 5 . Then, the DM 
creates an extra license (second license in (11)). 

{ PK D [Sym//Rights//contentid } , 1 } signCP , 



(ID 



{ PK; [Sym//Rights//contentid], PK DM } signDM . 



Finally, the user distributes these rights to the domain members. When accessing 
the content, a domain member must present to the device both licenses in (11) and the 
compliance certificate for the DM 6 . The reason to present both licenses is to allow the 
device to check if the user belongs to the domain (if he knows both PKi and PK D ) but 
also to check that the rights Rights ; < Rights (as an extra insurance for the CP because 
at the end the licence issued by the CP is checked). 

The procedure described above makes sure that only the user who has bought the 
content and has the master certificate (8) can create licenses for the domain members 7 . 

The introduction of the DM as a party who takes care of the user rights within the 
domain is also beneficial for the management of the countable rights. Now, the DM 



5 The certificate links the domain key PK D with member’s keys {PK,, PK„ ... PKJ. Alterna- 
tively, the DM can store this certificate. 

6 The compliance certificate for the DM might be issued by an authority, which certifies com- 
pliance of devices. 

7 If we do not use the double certificate as a master license, any domain member could use the 
master right to create the maximum rights for himself (as he will obtain exactly the master 
right as the delegation right in (1 1). 
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can issue new licenses and revoke old licenses when the spending of a countable right 
occurs. In that way the user privacy towards the CP is protected, because the CP is not 
contacted every time the user spends rights. Therefore the CP cannot create logs that 
link user’s PK, content identifiers, device identifiers and time when spending of 
countable rights occurs. Moreover, this solution is also beneficial for the CP, because 
the revocation of the old license is managed 8 by the DM and therefore it is instant 
(which resolves the problem of late revocation with the CA-SC). 

The presented approach achieves both rights differentiations within the domain and 
privacy towards CP for the domain structure (because the CP does not learn the struc- 
ture of the domain). Furthermore, the problem of behavioral privacy towards the CP is 
solved, because the CP cannot learn the time, content, device, and user’s PK for each 
user action that involves changing of countable rights. Finally, the solution for count- 
able rights revocation is appropriate as licenses are revoked instantly. However, the 
solution brings some complexity in the form of the introduction of the DM and one 
more license and certificate, but the CP has much less work (domain rights are issued 
by the DM, revocation is also done by the DM). 

5 Discussion 

In this paper, a DRM system is described which protects users' privacy while preserv- 
ing the system’s security. The privacy and security aspects of the system are discussed 
below. 

In the basic system, user privacy is achieved by decoupling the real user identity 
from his identifiers, namely PK and RAN, in the DRM system. Concerning the rele- 
vant entities in the system: 

• the SCI does not know any association of user’s identities and content/rights, 

• the CP knows the association (PK <-» (content, Rights, Sym)), 

• the CA-SC knows the association (PK RAN), 

• the CoD knows the association (RAN <-» (content. Rights, Sym)). 

Therefore, even by a collusion of the CP, the CA-SC and the CoD, the real identity 
of the user cannot be revealed since only the user knows the association (real user 
identity <-» PK). 

Furthermore, if an attacker is able to obtain user-related information from the CoD 
after a content access transaction happens, the associations 

• (real user identity <-» RAN), 

• (real user identity <-» (content. Rights, Sym)) 

become known to him. But since RAN changes periodically and only one piece of 
content is associated with the user’s real identity, the privacy damage is minimal. As 
the attacker cannot learn the user’s public key PK from the CoD, he cannot create a 
full log of the user’s ownership of content and pattern of content usage. 



Domain Manager can store a black revocation list of revoked licenses and request that each 
time before content is accessed on a device, the device checks that list. If the license used to 
access the content is in the list, device refuses the license and blocks access. 
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As for security requirements of the basic DRM system, the solution proposes a 
compulsory mutual compliance check upon a content access transaction. That is, the 
SC must always check if the CoD is compliant by means of a compliance certificate 
issued by the CA-CoD, and the CoD, in its turn, must also always check the SC for 
compliance, also by means of a compliance certificate. These certificates are such that 
they must be renewed often. The privacy of the user is preserved with the use of tem- 
porary pseudonyms for the SC. 

Concerning the transfer of licenses between users, the solution proposed also guar- 
antees the security of the DRM system and the privacy of the user. 

Security is dealt with via revocation of transferred licenses. This is achieved by 
means of the compliance certificate in (4), which includes the revocation list with all 
revoked licenses of a given SC. A requirement is that the compliance certificate in (4) 
be frequently renewed by the SC. This is done in the interest of both, the user and the 
DRM system, for the following reasons: 

• in the interest of the user, it is done in order to minimise linkability via the pseu- 
donym RAN of the user’s content access requests to different content 9 , and 

• in the interest of the DRM system, it is done as a requirement of the CoD which 
checks if the certificate (and therefore the license revocation list) is too old via 
the value Time, 

In case the user does not care much about the linkability problem (which would 
cause infrequent renewal actions on the part of the user), the renewal can be forced as 
a requirement of the CoD. As a consequence of this frequent renewal of compliance 
certificates, renewed values of revoked licenses of PK are also frequently available to 
the CoD. 

User privacy in the license transfer process is achieved by means of anonymous li- 
censes. These are licenses which can be redeemed at the CP for real usable licenses. 
They are anonymous since they do not include any identifier of the user who bought or 
exchanged his old license for the anonymous license. For security reasons, however, 
they must include a unique identifier that can be checked by the CP to prevent that an 
anonymous license is copied and redeemed multiple times. While guaranteeing secu- 
rity for the DRM system, this unique license identifier may be used by the CP to link 
the first user (who revoked his license) and the second user (who later redeems the 
anonymous license). Users’ privacy in this case is preserved with the use of blind 
signatures. 

The use of identity-based cryptography to generate the signing key pairs for the CP 
also enhances the system’s security. In addition to greatly facilitating key management 
by the CP 10 , the solution allows anyone to check the CP’s signature on the anonymous 
license if they know the content identifier, the rights and the provider’s name (since 



9 Note that, even if RAN changes, a given user’s content access requests to the same content 
and in the same CoD allows that CoD to link the two actions via the license, but this only if 
the CoD keeps a record of each and every license shown to it. 

10 In this case, the CP does not need to store the list of all RxC key pairs anymore (a private key 
can be generated each time it is needed). And even in case storage is preferred over computa- 
tion, only the private keys need to be stored. 
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these values make up the public key). The check of the CP’s signature is essential in 
case the second user buys the license from the first user. The second user needs to be 
sure that the anonymous license he receives from the first user indeed refers to a given 
content with given rights, and that the license can be redeemed with a given CP. 

Finally, regarding the distribution of licenses to domain members, the solution pro- 
posed also guarantees the privacy of users while preserving the security of the DRM 
system. This is achieved by means of an approach for private creation and functioning 
of an authorized domain. 

The approach provides privacy concerning the domain structure by preventing the 
CP from learning which domain members compose a domain. A Domain Manager 
device is introduced to solve privacy problems within the domain. It is used for issuing 
rights to domain members, which in turn allows differentiations of rights among them 
without the involvement of the CP. This device is a compliant device, which is trusted 
by the CP, thus guaranteeing the DRM system’s security. Furthermore, the Domain 
Manager device decreases the workload of the CP, taking over the management of 
countable rights. While solving the problem of late revocation of those rights, this 
further provides behavioral privacy for users in the domain. 
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Abstract. Coordinated Web services can help alleviate user's privacy and eco- 
nomic, social, and ethical concerns that arise from third parties’ access and use 
of user private data. This paper focuses on the requirements and design of such 
services in support of a client-side private data management system. Appropri- 
ate management of private data on the client side can both educate and assure 
users that their privacy is well guarded, and that their private data is being used 
by entities which satisfy economic and/or ethical user concerns. Our solutions 
describe novel Web services, interaction with P3P agents, and a client-side pri- 
vacy architecture. A preliminary prototype implementation of our Web services 
using standard UDDI, SOAP, and WSDL technologies and rudimentary delay 
estimates are briefly discussed. 

Keywords: P3P, Privacy Web Services, Private Data Management 



1 Introduction 

Industry watchdog Gartner Group has predicted that, by 2006, the number one barrier 
to electronic business and commerce will be user concerns over information privacy 
(Gartner, 2003). Empirical evidence shows users’ trust in electronic business can be 
heightened by pragmatic means such as the use of privacy enhancing tools at the cli- 
ent-side, and simple support mechanisms at the business side (Jutla et al. 2004). Thus 
users’ influence on business’ fair information practices will increase as we become 
empowered with more powerful online privacy tools. Businesses, in future, could use 
their handling of online privacy as a competitive advantage as opposed to a cost or 
barrier to business opportunities. An example is the business that appears as a hit on 
search results pages as search engines become enabled with adding user privacy pref- 
erences to search criteria. AT&T scientists are currently building such a prototype 
search engine incorporating agents based on the Platform for Privacy Preferences 
(P3P) specification (Byers et al. 2004). 
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Although privacy is a major concern, it is not the only concern when considering 
handling of private data. Users would also like to be assured that their data is being 
shared with entities or businesses that they consider to use ethical labor and work 
practices, or who are environmentally friendly. They may also be concerned about 
market dislocation in a global economy. Many citizens, especially small business 
owners support local or national economies, and may express this by a preference to 
buy Canadian, for example. 

Important technical projects, such as P3P, address privacy. At heart, P3P provides 
an XML vocabulary and a data model for supporting user’s online privacy. The P3P 
protocol and agents are designed for the automatic machine reading of Web sites' 
privacy policies and their comparison with user privacy preferences specified at the 
client-side. P3P-enabling of a Web site refers to the business’ storage of its privacy 
policy in XML-based P3P format on its Web site. Currently support of P3P from busi- 
ness stands at 30 percent of top 100 Web sites and 23% of top 500 Web sites are PSP- 
enabled (Cranor 2003). Top 100 refers to the top 100 most visited Web sites. Adoption 
figures are not yet available for AT&T Bird. Judging from one author’s observation of 
150 university students’ enthusiastic reaction to Bird, adoption figures will be favor- 
able for the 18-24 year old demographic. While P3P agents and the P3P platform are 
significant steps for Web users’ privacy protection, in this paper we will motivate 
complementary Web services, extensions to P3P agents, and other private client-side 
data management components to increase the user’s control or management of his/her 
private data. As noted before, we want to allow the user to have control and manage 
his/her private data for purposes beyond privacy - purposes that include users’ con- 
cerns over relationships with sites that have social and ethical values, and economic 
interests that are in conflict with those of the user. 

The paper organization is as follows. Section 2 provides an example scenario to 
motivate and clearly position where our private data management solutions, including 
Web services, fit in the world of user concerns over private data. Section 3 presents 
the client-side private data management architecture and its agent objects that invoke 
Web services for further management support. Section 4 presents the design of Web 
services in UML. In section 5, implementation of the Web services, including access 
to a prototype privacy ontology that we created for the Canadian Personal Informa- 
tion Protection and Electronic Documents Act ( I 1 IP EDA j, and a generic user regula- 
tory agent illustrate the implementation and execution feasibility of our Web-services 
design for electronic private data management. Related work is summarized in section 
6. The final section offers summary and conclusions. 

2 Example User Scenario and Requirements 

Consider a busy professional living in the US who is seeking an online pharmacy to 
fill a prescription. She invokes the Google search engine and searches on “online phar- 
macy” keyword string. She sifts through the results and finds several Canadian phar- 
macies that have considerably lower prices for the medication. (We acknowledge that 
some large drug companies are introducing governance policies to suppress this prac- 
tice, however the example is still applicable and analogies can be drawn across 
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industries.) She normally buys from Walblue’s with which she enjoys a trusted rela- 
tionship. Switching to a Canadian supplier (CanPharma.ca) for this medication has 
uncertainty overhead associated with it. 

Areas of uncertainty involve mainly privacy concerns, but also security, ethical, 
economic, and social concerns. The economic concern here is for herself as opposed 
to a local economy. Various issues that she may be concerned about are: 

(1) She wants to know all the intended purposes/uses and possible dissemination of 
her medical information. 

(2) She wants to do business only with firms that post a privacy policy. 

(3) She wants to know that the business’ privacy practices match her privacy prefer- 
ences and be alerted if they do not. 

(4) She wants a store that will not only encrypt her credit card information but also 
the prescription contents of her shopping basket. 

(5) She wants to know whether her privacy preferences match the privacy practices 
of each of the pharmacy’s third party business partners. 

(6) She wants to know whether privacy laws and authorities exist in Canada to en- 
force the intentions stated within the privacy policies on the pharmacies’ web 
sites. 

(7) She also does not want to do business with a company that has Cheaterslnc or 
UnGreenCompany as a third party business partner because she considers these 
to be unethical or environmentally-unfriendly. 

(8) She does not want to have her information shared with a third party business 
partner that is in a country with poor privacy laws. 

(9) She does not want to deal with a company that shares customer data with a third 
party partner originating from a country with human rights abuses. 

(10) She wants to know which laws/regulations on privacy and data protection are 
applicable to the context of her transaction. 

(11) She wants to know which law/regulation on privacy and data protection has 
precedence for her transaction. 

(12) She wants to know that she does not inadvertently provide information on a Web 
form at this site that goes against her stated privacy preferences. For instance, 
she has a preference not to give out her age, but she provides a site with her 
birth-date and weight in the context of buying prescription medication. 

(13) She would like to negotiate a quick electronic contract with CanPharma.ca, in 
which the company becomes obligated to destroy her data if it and its assets are 
sold to another company. 

(14) When she returns to CanPharma.ca, she would like to review her information 
and contracts. 

(15) She wants to review her privacy beliefs for this site when she returns to Can- 
Pharma.ca. 

(16) She wants to maintain online privacy beliefs for particular sectors. 

Analyzing these requirements, we see that current P3P agents will support only the 
first three requirements on this list. The rest of these requirements can be supported by 
our private data management architecture. Requirement 4 can be satisfied by an exten- 
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sion to P3P with a <SAFEGUARDS> tag and accompanying extension of the P3P 
agent’s matching algorithm to do a SAFEGUARDS comparison. Satisfying require- 
ments 5-9 are the focus of the cooperating Web services introduced in this paper. Data 
support in a Web privacy ontology combined with client-side data support is envis- 
aged to support requirements 10-11. Data support on the client-side to satisfy require- 
ments 12-16 are provided in the client-side architecture briefly described in the next 
section. However, agents to implement negotiation, contracts, and client-side monitors 
for privacy are the subject of another report (He 2004). To be clear, we focus solely on 
motivating, designing, and implementing Web services to support private data man- 
agement and which satisfy requirements 5-9 in this paper. 

2.1 User Interface for Input of Preferences 

Interface design is not a central issue for this paper. Yet it will be extremely important 
to the viability of any e-privacy or private data management application. Unless we 
have a breakthrough in UI design that is widely adopted in the coming year, it appears 
that form-based user interfaces and Y/N controls such as checkboxes or radio buttons 
are the most familiar, convenient, and user-friendly means to receive input into a sys- 
tem. Thus for the purposes of this paper, where we would like the reader to visualize a 
user setting preferences, we opt to suggest useful extensions to a form with similar 
content to the Privacy Preference Settings form that AT&T Bird uses (see 
www.privacybird.com/tour/l_2_beta/ privacypreferences.html): 

We would add the following items to that form or to a subform that deals with third 
party-recipients: 

- Warn me about companies that share customer information with other compa- 
nies that do not have privacy statements 

- Warn me about companies that share with other companies whose practices vio- 
late my privacy, ethical, and/or social preferences 

- Warn me about a company that has a third party partner that is on my blocked 
list 

- Warn me about businesses, 3 rd party, or otherwise, that are in jurisdictions with 
no enforcement of the fair information practices 

We acknowledge that input forms need to be short and P3P and P3P agents such as 
Bird kept functionality to an important core in order to improve chances of widespread 
adoption. Nonetheless, completeness around user privacy and private data handling 
concerns should be explored. 



3 Architecture and Web-Services for Privacy 

Empirical results obtained from structural equation modeling analysis of user data 
(Jutla et al. 2004) show that the adoption of user intervention ( UIV) tools such as PSP- 
based agents, encryption, cookie cutters, pseudonymizers, and anonymizers increase 
users trusting beliefs in e-business and in Internet-based trust. These and other re- 
search results motivate the creation of client-side private data management architec- 
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tures that are based on the industry recommended P3P protocol, and that are open to 
other user intervention tool add-ons. In (Bodorik and Jutla 2003, Jutla and Bodorik 
2004a), a client-side privacy architecture based on the P3P platform is proposed and 
elaborated. The architecture fosters user’s perception of control and thus increases 
user’s trusting intentions to conduct e-business. 
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Fig. 3.1. Private Data Management Architecture. 



We extend the original e-privacy architecture with the Web services proposed in 
this paper, thereby moving the architecture to the realm of private data management. 
A much simplified diagram depicting the resulting architecture is shown in Figure 3.1. 
The figure shows two key architectural components: a client-side component and a 
Web-side component (the regulatory web privacy ontology). The architecture includes 
internal and external agents and their interaction and access to repositories containing 
private data and privacy control information. For simplicity, the figure represents a 
number of internal agents by one agent icon and, similarly, a number of repositories 
by one repository icon. The repositories icon generically represents a number of re- 
positories to store private data, preferences, profile, contracts, service-site data, his- 
tory, specific regulations, and audit trails. The internal agent icon represents the fol- 
lowing agents and services: 

• Monitor agent that observes the user’s actions and reports this to the user’s per- 
sonal context manager. Motivation for this agent is that users are known to take 
actions in contradiction to stated, static privacy preferences (Spiekermann et al. 
2001). The architecture supports dynamic induction of a user’s privacy prefer- 
ences from multiple inputs including the monitor feed. 
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• Arbitrator agent that negotiates, with a human-in-the-loop escalation approach, 
privacy contract(s) with web sites. 

• Personal context agent that maintains the context within which the user operates, 
provides the user with information on the context and her action, particularly when 
her preferences and actions conflict, and seeks her guidance/instructions related to 
preferences in context. The personal context agent has a preference induction en- 
gine that maintains the user’s dynamically changing preferences within contexts. 
We elaborate on contexts and context agent in (Jutla and Bodorik 2004b). 

• Regulatory agent maintains user privacy preferences and information about pri- 
vacy regulations, guidelines, rules, and any user-pertinent privacy governance in- 
formation that guide the user privacy agents during user transactions with service- 
sites. This agent invokes the Web private data management services described in 
this paper. 

• Web services, shown in the figure, provide information about regulations that 
apply in different privacy regions/countries, information to be used by user 
agent(s) to adjust privacy preferences accordingly and by the user to take 
appropriate actions in terms of managing her private data collected by service 
sites. 

• The Web privacy ontology structure(s) (ideally there should be a number of these 
ontologies - at least one per country, and possibly one per legal domain that deals 
with private data) shown in Figure 3.3, and presented in section 3.3, support the 
needs, for regulatory information over private data, of the client-side internal 
regulatory agent. 

3.1 Data Model for Client-Side Private Data Management System 

The data model for the private data management architecture, including data reposito- 
ries for the regulatory agent (which invokes cooperating Web services to maintain a 
coherent regulatory knowledge base for the user), is shown in Figure 3.2. The diagram 
shows relationships among interacting objects, and significant attributes of these key 
objects. We use this diagram merely to show what types of information the client-side 
architecture stores. Specifically, the Web services populate the Regulation and Regu- 
latory Belief stores. Information may be stored in RDF or OWL format in an ontology 
structure to promote storage of richer knowledge around user beliefs. The intent is to 
have semantic information kept on subjects such as jurisdiction, regulation, users’ 
regulatory beliefs, business, users’ trusting beliefs about a business, transaction con- 
text, user information, user role information, and user preferences per role informa- 
tion. 

Jurisdiction covers information on geographical jurisdiction, legal, international, as 
well as community, and association guidelines. Regulation-related information in- 
cludes applicable laws and other type regulations related to privacy per jurisdiction. A 
user context is made up of a set of beliefs about the current situation including user 
preference beliefs about release of certain data, trusting beliefs in business (substitute 
government, community, individual or other stakeholder instead of business), regula- 
tory beliefs, and beliefs around transactions. Transaction refers to e-commerce related 
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transactions such as browse, buy, register, and collaborate. The intent is for the user 
context to be also composed of beliefs around each type of transaction. For example, 
users may have a belief that a surf transaction implies potential for his/her clickstream 
data to be collected. 




Fig. 3.2. A Data Model for private data management architecture. 



3.2 Regulatory Privacy Ontology and Services 

We have proposed a model for high level Web privacy ontology reported in (Jutla and 
Bodorik 2004a). We have also implemented one layer of this privacy ontology (Jutla 
et al. 2004b). In this paper, by adding P3P tags to the concepts and their definitions in 
the regulatory privacy ontology, we facilitate a Web service to perform a three-way 
comparison among user preferences, business practices, and government regulations. 
This comparison could be useful to an Internet user in several ways. A comparison 
between the contents of P3P elements representing business privacy practices and 
those representing privacy law may result in highlighting to the user (1) omissions in 
the business’ P3P policy statements, or (2) concerns of mismatch of interpretation of 
privacy legislation. The P3P specification is not yet mature enough in terms of ele- 
ment definitions to cleanly handle many legal subtleties - hence a Web service can be 
useful to the user in flagging absence/presence, or ambiguity, of fair information prin- 
ciples regarding privacy as defined in law in the business’ practices expressed in P3P 
policies. 

Accessing privacy ontologies containing information expressed using P3P tags, can 
facilitate the user in populating their user preferences in an informed way. A P3P- 
agent comparison of user privacy preferences and the corresponding concepts in a 
regulatory Web privacy ontology can flag user inattention to details in their user pref- 
erences ruleset. For instance, the user’s preference rule may state that a data element 
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may be retained by a web-site company indefinitely while an applicable law may limit 
the retention of private data to six years. 

In this paper we use WSDL (W3C WSDL 2003) to describe the functionality of 
each Web privacy ontology service, the sequence and cardinality of messages 
sent/received by a service operation, and binding specifications to multiple protocols 
(e.g. SOAP, HTTP, and MIME). WSDL files are defined for (a) GetPrivateDataCon- 
cerns(context), (b) GetPrivacyLaws(country), (c) GetFIPsInfo (principlename, coun- 
try, privacylaw) where FIPS is the acronym for fair information practices, (d) Match- 
PolicyAndLaws(SitePolicy, jurisdiction), PrivateDataSearchService(SearchCriteria), 
and so on. 

3.3 Web-Side Architectural Components and Services 

The client-side Private Data Management ( PDM ) system includes Web-services that 
must exist to provide the users with requisite and useful privacy knowledge in support 
of the user’s activities on the Web. Figure 3.3 illustrates our Web-services architec- 
tural model. Service providers first register their services with the Universal Discov- 
ery, Description, and Integration (UDDI) service - shown in the figure as step 1. The 
regulatory agent finds/discovers a service through UDDI directory lookup, shown as 
step 2, and binds/invokes a service (step 3) that may provide for composition of pri- 
vate data management services to answer deceptively simple, but yet complex, user 
queries on private data handling, and receives summary results (step 8). 

Furthermore, these Web-services require a privacy knowledge base supported by an 
ontology. More specifically, a number of ontologies, spanning various countries and 
regulations regarding handling of private data, will be required. We utilize a standard, 
generic UDDI directory to get access to these ontological resources. We assume the 
extension of domain-specific classification for UDDI, in this case domain-specific 
classification for private data handling. The regulatory agent accesses the functionality 
of the UDDI registry through invoking a set of public Web services interface methods. 
The additional interface Web methods to the UDDI registry may be (recall PDM is an 
acronym for private data management): 

a. GetProviderWSDL(providerID, aPDMService, nJurisdiction) where providerlD 
is an output parameter that contains a list of providers that can perform 
aPDMService. The aPDMService input parameter represents a generic name for 
a particular private data management service, and the nJurisdiction input pa- 
rameter refers to one or more jurisdictions. An example of aPDMService is 
“Get ViolationPenal ties” where the latter refers to retrieving providers that can 
identify legal penalties for handling private data violations in the specified juris- 
diction^). In February 2004, the P3P committee added a useful 
<JURISDICTION> extension to the <RECIPIENT> tag which makes 
aPDMService more easily feasible. We had previously identified the need for a 
Jurisdiction P3P tag, thus illustrating how development of privacy/private data 
Web services, and more importantly user requirements, can drive the maturity of 
the P3P vocabulary and vice versa. 
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b. SetProviderWSDL(providerID, aPDMService, nJurisdiction, WSDL) is obvious 
and adds/updates the WSDL entry for the corresponding supplier, service, and 
jurisdiction. 

c. QOSProvider( aPDMService, QOS_list) returns a list of suppliers of private data 
related services based on quality of service (QoS) input parameters such as dis- 
tance, reputation, reliability, availability, timeliness, and cost. The distance QoS 
parameter is perhaps the most interesting as it is intended to be the result of a 
distance function that measures what is the “closest” service to what is being re- 
quested. 




Fig. 3.3. Architectural Model of Web Services for Private Data Management. 



4 Modeling Web Services for Private Data Management 

Consider a user P3P agent which read in Business Technology Services (BTS) Inc.’s 
privacy statement that BTS “may share consumer information with its strategic part- 
ners”. Also suppose that a user sets the requirement in her preferences to find out who 
the company’s partners are. The P3P agent communicates this requirement to the 
regulatory agent which could then issue a query “who are BTS’s partners” to an exter- 
nal Web Service named WhoArePartners (see Figure 4.1). The invocation of this ser- 
vice is not left up to the P3P agent, rather recall that one functional role of the regula- 
tory agent is to maintain client-side knowledge about the users’ trusting beliefs for 
various companies. 

For a more complex scenario, the regulatory agent may ask “who are BTS’s part- 
ners and where are BTS’s partners based? “ If answers include that one partner is in 
Japan, the other in India, and a third in England, then the regulatory agent will invoke 
privacy Web services to determine the level of privacy protection provided by each 
jurisdiction’s laws. Other Web services that examine other laws that deal with the 
handling of private data, such as consumer protection laws, can also be invoked. 
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We design a Web service to perform the three-way comparison of user’s privacy 
preferences, business privacy policies, and government laws relating to the handling of 
private data. 




Fig. 4.1. WhoArePartners Service. 




I 

Fig. 4.2. Simultaneous matching of User BPreferences and Partners’ Policies (only 3 partners 
shown). 



We model example composition of Web services for private data management us- 
ing simplified UML sequence diagrams for dynamic modeling. The three-way, or 
ComplexPDMService (Figure 4.4.), service is composed of two Web services, shown 
in 4.1, and 4.3 respectively. The first individual Web service, WhoArePartners, finds, 
say three, partnering companies. The simultaneous comparison of the user privacy 
preference with each business partner’s P3P policy is designed as per Figure 4.2, and 
executed on the client-side. Three user P3P agents simultaneously compare the users’ 
private data preferences with the three partners’ site policies. Another Web service for 
comparing each partner’s P3P policy with jurisdictional privacy regulations or guide- 
lines for fair information practice is shown in Figure 4.3. 
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The ComplexPDMService, shown in Figure 4.4, combines findings from the other 
2 Web services and P3P agents (Figures 4.1, 4.2 and 4.3) and returns the results to the 
user regulatory agent which then recommends an appropriate action to the user. The 
implementation of these Web service designs are discussed in the next section. 




Fig. 4.3. Matching Business Policies and Jurisdictional Principles for Privacy. 
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5 Implementation 

Thus far we have implemented the infrastructure for the Web services described 
above. The P3P agents have been implemented, and user preferences and contextual 
information have been implemented in an OWL structure on the client-side. While 
P3P agents use the preference knowledge in the client-side structure, other agents can 
access and modify this OWL structure for automating contextual decision-making 
around e-privacy. This contextual decision-making is the reason we chose OWL to 
store preferences on the client-side, rather than use APPEL. The Web Service in Fig- 
ure 4.3 retrieves regulatory information about a jurisdiction’s fair information prac- 
tices from an experimental regulatory privacy ontology on the Web (we created and 
stored this ontology in the Sesame database in the Netherlands) which returns results 
that are used for comparison with the business’s policies. 

Currently we have the Web services communication for the WhoArePartners Web 
service (Figure 4.1) but have not implemented full functionality. We also note that the 
WhoAre Partners service is particularly challenging as we may need to handle cascad- 
ing sets of partners. That is, each third party partner may have in turn its third party 
partner. Governance in a firm or jurisdiction must be able to set limits as to whether to 
cascade, or to assume that the cascade level is one, as in the case when the user is 
assured that the original firm has a contract with its third party partner not to further 
share any data sent to them with other organizations. Alternatively, it would be, func- 
tionally but not organizationally, easier for P3P to be extended to encourage compa- 
nies to list third party partners through addition of a PARTNERS tag. As most compa- 
nies are not compelled to do partner disclosure, we are uncertain whether there could 
be consensus towards such a P3P extension. The case for it will be made depending on 
how pro-active businesses get in contributing to online trust. 

The service in 4.4 is partially implemented. We do not have automatic composition 
of services in the ComposedPDMService as envisioned. Rather we have implemented 
a bouquet service where we fixed the individual services that created the aggregate. 
Another incremental version will add a reasonable algorithm based on fuzzy matching 
and QoS parameters for selecting the right aggregation of services. We will also be 
investigating Web Services Choreography Interface, and OWL-S, the Web Ontology 
Language for Services, for a more sophisticated version of our Web service shown in 
Figure 4.4. 

We chose to implement our Web services using components of two of the major 
Java Web service tools, Java Web Services Developer Pack (JWSDP) version 1.3 
from Sun Microsystems, and Web Service Development Kit (WSDK) version V5.1 
from IBM. We used the Java API for XML-based RPC (JAX-RPC) vl.l, the Ant 
Build Tool 1.5.4, and Apache Tomcat v5 development container from JWSDP. From 
WSDK we used UDDI4J, IBM WebSphere UDDI v2.0 registry, and eclipse plug-ins 
to enable browsing for Web services in UDDI registries, to create Web services from 
WSDF definitions, and to publish and unpublish Web services to a UDDI registry. 
JWSDP 1.3 contains the latest versions of Java and XML technologies for building 
reliable and secure Web services. It also utilizes the industry well know open standard 
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Ant and Tomcat as the compiling and run time environment. We used IBM’s UDDI 
2.0 registry as it exactly meets the UDDI 2.0 specification. 

We publish a tModel and service onto IBM’s private UDDI directory through a 
graphical interface. Figure 5.1 shows the results returned from searching all the regis- 
tered businesses with the name starting with “Eprivacy” from Websphere’s UDDI 
private directory. We also did a public UDDI version. 

The Web services we created are a “Complex PDM Service” (Figure 4.4) with 
modifications as described), multiple “Privacy Ontology Query Service” (partly Fig- 
ure 4.3), and one “WhoArePartners Service" (4.1). All are implemented as Java XML 
RPC services which accept SOAP XML requests. Both the internal regulatory agent 
and the composite Web service are implemented as JAX-RPC clients. The JAX-RPC 
client is implemented using a dynamic proxy model which creates the proxy from 
WSDL file at run time. Depending on the regulatory query, the complex Web service 
propagates multiple threads to invoke multiple individual ontology query Web ser- 
vices in parallel. After this the complex Web service wraps up the answers and passes 
the result to the regulatory agent. 



Query Results 

Select a result to see more details or select a set of results and click a button to perform an operation. 
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Fig. 5.1. Services registered in Private UDDI Directory. 

Our implementation of the privacy ontology sits on the Sesame server. Sesame is 
an open source RDL database which provides an interface for both local and remote 
access. In our “Privacy Ontology Query Service”, we import the Sesame API 1.0 in 
our Web service program. The “Privacy Ontology Query Service” makes HTTP con- 
nections and then passes the RQL inquiry strings to the remote Sesame server. The 
Sesame server then responds with a result-table to the requester. 

We created four versions of the Web Services infrastructure for the Com- 
plexPDMService. The first version is illustrated in Figure 3.3 where all private data 
management services are found in a public UDDI directory. The second version was 
created to test the overhead difference in parallel connections to ontology service 
providers as opposed to serial connections. The third version is the same as the second 
version except that the individual services that make up a complex private data man- 
agement service is found in a private UDDI directory, while the ComplexPDMService 
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Provider is found through public UDDI lookup as before - this is shown in Figure 5.2. 
Version 4 uses only private UDDI directory lookups. 

Early measurements, at various times of day and week, show that the longest delay 
is on access to IBM’s public UDDI directory. The time to access the public UDDI 
directory was on average 25 times slower than access to the privacy ontology stored in 
Sesame. Version 2 showed that parallelizing ontology searches sped up service time 
by 16% or approximately by 5 seconds. Version 3 shows us that access to a private 
UDDI directory is 40 times faster than access to IBM’s public UDDI directory. In 
version 3, the round-trip delay for results to be returned to the regulatory agent was 
half that of version one (14 seconds vs. 27 seconds). Version 4 clocked the best round- 
trip delay (from regulatory service initiation of first UDDI lookup and return of final 
results from the ComplexPDMService) at 7s. These times are acceptable for informa- 
tional services not on the critical path of a user transaction. However the times are 
expected to improve once platforms are optimized to support Web services. 




Ontology Service 
Providerl 



WhoArePartners 
Service Provider 



Ontology Service 
Provide r2 



Fig. 5.2. Public UDDI Directory to find the ComplexPDMService Provider, and a Private 
UDDI to find Ontology and WhoArePartners Service Providers. 



6 Related Work 

The Platform for Privacy Preferences (P3P) was published by W3C in 2002 (P3P 
2004) and, regardless of some shortcomings, it is the only contender on which to base 
privacy mechanisms. The latest working draft of P3P version 1 . 1 was released in April 
2004. A number of tools have been developed for Web-masters to post privacy poli- 
cies in P3P format. Microsoft IE6 and Netscape Navigator 7 Web browser provide 
basic P3P functionality. AT&T provides a P3P agent called Privacy Bird as an add-on 
to 1E6 browser. The add-on checks for P3P policies for all content on a page visited by 
the user, compares them to the user preferences and reports on the match using a traf- 
fic-light metaphor in its interface. A study of users mainly over 50 year olds reports 
that the Privacy Bird is a useful agent (Cranor 2002). The user privacy agents simplify 
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the task of examining the privacy policies posted by the Web-sites and determining 
whether or not the they are acceptable to the users/clients - a task that is cumbersome 
and disliked by users (Cranor 2002). 

In contrast to client-side P3P agents, Agrawal et al. (2003) proposes an efficient 
server-based P3P-based implementation of the comparison of user preferences and a 
site’s P3P-enabled privacy policy. Although the IBM researchers’ ideas have many 
advantages, including convenience and performance, our main concern with it is that 
users’ preferences must be uploaded to the server side, thus possibly enabling further 
user profiling. But more importantly, the server-side scheme is not suitable for e- 
commerce at this stage where initial online trust formation is still an issue, particu- 
larly, among on-line buyers and small and medium sized enterprises which comprise 
99% of businesses in various countries. The scheme in (Agrawal et al. 2003) scheme 
may be more suitable for large businesses, or those businesses in trusted sectors such 
as the financial sector, or may get more user acceptance after e-commerce matures 
considerably more over the next decade. 

The work that most resembles the links, between user’s data and business, in our 
architecture is the iManager architecture (Jendricke and Markotten 2000) that contains 
databases for personal data, personas, URLs, and rules. The iManager does not sup- 
port significant stakeholder influence or social, economic, and regulatory feeds. Us- 
ability results are not yet available for the iManager to the best of our knowledge. It 
does not describe how the control of the personal identity is affected by the external 
entities/stakeholders. 

Several proposals exist for trusted third party (TTP) storage of user profiles and 
preferences. A proposal to access a user profile, anywhere and anytime, through any 
device, is described in [Cingil 2002], The user is required to do a browser-login to the 
TTP and her surfing behavior, via click-stream, is monitored and captured locally and 
used to update the user’s profile. The major problem is the centralized and authorita- 
tive approach that does not allow the user control over the collected information. 
Many users prefer their profiles to be fragmented across many devices since fragmen- 
tation provides a form of privacy protection on its own, similarly to un-synthesized 
databases. 

Turner et al. (2003) propose a privacy framework for Web services. This work dif- 
fers considerably from ours, in that its intent is to organize mechanisms to minimize 
private data being handed to Web services, and to provide privacy in Web services, in 
general. In contrast, our Web services are informational, dedicated to providing 
information pertinent to the handling of private data. Our Web services avoid handing 
over the user’s private data. However Turner’s (2003) framework can complement our 
Web services, or we can apply the framework to our Web services, if we should have 
to pay or register for access to the various ontologies and WhoArePartners services for 
example. It is our hope that governments will provide free read access to future pri- 
vacy ontologies. We also recommend that the P3P vocabulary would benefit users if a 
PARTNERS tag can be added to list firms’ third party partners. Then P3P agents 
could replace the WhoArePartners service. 

Our work builds on the client-centric vs. server centric concept as evidence, both 
quantitative (Jutla et al. 2004, Novak et al. 2000) and qualitative (Aggarwal et al. 
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2004, Bodorik and Jutla 2003, Jutla 2003, Jutla and Bodorik 2004a), point to increas- 
ing users’ control over online private data contributes to an increase in their trusting 
beliefs and intentions to engage in e-business behaviours with firms. We propose in- 
formational Web services for private data management including for the purposes of 
online privacy. We show how P3P agents can interact with these Web Services to 
provide the user with more information to make informed decision making about 
whether he/she can trust the site with which he/she is dealing. 

The Resource Centre on P3P of JRC (JRCarchitecture 2004) has a basic privacy ar- 
chitecture that does not include access to Web-services or cooperation with Trusted 
Third Parties (TTP) as yet. However, it is an impressive platform for extended re- 
search on e-privacy that already has a demonstration site and various downloadable 
tools. Furthermore, an ontology for data protection is in the planning stage. It is a 
substantial and long-term undertaking that involves education and participation of the 
various stake-holders in arriving at the standard ontology (JRContology 2004). 

In a two-page position paper, (Kim 2002) argues that privacy be built into the Se- 
mantic Web and stresses the need for privacy ontology. This is also one of the conclu- 
sions in (Rezgui 2003). We have proposed a high level model for a privacy ontology 
in Jutla and Bodorik (2004a), and implemented an ontology fragment as proof-of- 
concept (Xu 2004). The Web services described in this paper accesses this ontology 
stored on Sesame, an RDF database created in the highly regarded European OntoK- 
nowledge project. 

7 Summary and Conclusions 

In this paper, we first motivate the need for informational Web services around han- 
dling of private data by describing a user scenario and identifying the user require- 
ments. Then, we compose combinations of Web services, to form the backbone of 
sophisticated querying of distributed ontologies and knowledge bases, in order to pro- 
vide the user with reliable advice around private data exposure. There are possibly 
many laws to which a business must comply when handling customer personal infor- 
mation such as consumer protection laws and sector-dependent laws, such as the Al- 
berta Health Information Protection Act in the province of Alberta, Canada, or HIPAA 
in the US. Labor laws govern employee personal information. Depending on the cir- 
cumstances of the dispute, one law can have precedence over the other. Clearly, sev- 
eral regulatory Web ontologies will hold semantic information to support privacy Web 
services. Knowledge bases for privacy need not only be about laws and acts regarding 
handling of private data but may provide users with industry standards, and cultural 
guidelines around privacy that may affect the transacting parties. We propose a novel 
three-way comparison using P3P tags, and suggest that these tags are embedded into 
the concepts and definitions within regulatory ontologies containing knowledge about 
the handling of private data. We designed and partially implemented a Web service to 
achieve the 3-way P3P comparison proposal in this paper. 

We illustrate how private data management can be implemented and executed 
through the interaction of Web services, P3P agents, and architectural component 
objects in our system. Our client-side private data management system is open and 
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already inclusive of today’s P3P agents. Not unexpectedly, users are saying that they 
would like to have integrated PET tools (Bayers 2003). User intervention add-ons such 
as cookie crushers and anonymizers can be hooked in a future version of our system. 
Encryption is obviously assumed as included in any privacy architecture. We need 
security for the Web services to prevent public inferences based on what sites we visit. 
However much of our Web services inputs/outputs contain information that is publicly 
available. An advantage of our client-side private data management architecture is that 
user preferences are never sent out over the net. 

Fledgling regulatory ontologies are becoming a reality since semantic web and on- 
tological engineering technologies are available and maturing, and international 
groups are interested in their development. Once these ontologies are in place the Web 
service proposed in this paper will produce more sophisticated and useful results. 
These Web services are applicable and may form an essential part of the private data 
management tool kit of the future global Web user. 
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Abstract. Researchers have recently begun to develop and investigate policy lan- 
guages to describe trust and security requirements on the Semantic Web. Such 
policies will be one component of a run-time system that can negotiate to estab- 
lish trust on the Semantic Web. In this paper, we show how to express different 
kinds of access control policies and control their use at run time using PeerTrust, 
a new approach to trust establishment. We show how to use distributed logic 
programs as the basis for PeerTrust’s simple yet expressive policy and trust ne- 
gotiation language, built upon the rule layer of the Semantic Web layer cake. 
We describe the PeerTrust language based upon distributed logic programs, and 
compare it to other approaches to implementing policies and trust negotiation. 
Through examples, we show how PeerTrust can be used to support delegation, 
policy protection and negotiation strategies in the ELENA distributed eLearning 
environment. Finally, we discuss related work and identify areas for further re- 
search. 

Keywords: Automated Trust Negotiation, Peer-to-Peer, Semantic Web, Policy 
Languages 



1 Introduction 

As peer-to-peer architectures start to move into use for applications based on the Seman- 
tic Web, they must address the issue of access control for sensitive resources provided 
by peers in the network [9, 19], such as services, documents, roles, and capabilities. 
For example, in the Edutella infrastructure [15, 14, 16], each peer manages distributed 
resources described by RDF metadata, and interfaces to the Edutella network using a 
Datalog-based query language. The early Edutella testbeds focussed on providing dis- 
tributed learning repositories in an environment where all resources are freely available; 
the main research focus was efficient searching for course-related information using 
appropriate queries over the metadata available for that information. More recently, 
however, the Edutella infrastructure has been deployed in the context of the EU/IST 
ELENA project [18], whose participants include e-learning and e-training companies, 
learning technology providers, and several universities and research institutes (see also 
http://www.elena-project.org/). To meet the needs for access control in this peer-to-peer 
network that connects commercial e-learning providers and learning management sys- 
tems, Edutella must also support access control policies that describe who is allowed to 
access each document and service. 

W. Jonker and M. Petkovic (Eds.): SDM 2004, LNCS 3178, pp. 1 18-132, 2004. 
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For example, suppose that E-Learn Associates manages a Spanish course in the 
peer-to-peer network, and Alice wishes to access the course. If the course is accessible 
free of charge to all police officers who live in and work for the state of California, Alice 
can show E-Learn her digital police badge to prove that she is a state police officer, as 
well as her California driver’s license, and subsequently can gain access to the course 
at no charge. 

However, Alice may not feel comfortable showing her police badge to just anyone; 
she knows that there are Web sites on the west coast that publish the names, home 
addresses, and home phone numbers of police officers. We can view her police badge 
as an item on the Semantic Web, protected by its own release policy. For example, 
Alice may only be willing to show her badge to companies that belong to the Better 
Business Bureau of the Internet. But with the introduction of this additional policy, 
access control is no longer the one-shot, unilateral affair that one finds in traditional 
distributed systems or in recent proposals for access control and information release 
on the Semantic Web [9, 19]: in order to see an appropriate subset of Alice’s digital 
credentials, E-Learn will have to show that it satisfies the release policies for each of 
them; and in the process of demonstrating that it satisfies those policies, it may have 
to disclose additional credentials of its own, but only after Alice demonstrates that she 
satisfies the release policies for each of them; and so on. Thus the use of policies and 
digital credentials as a basis for access control on the semantic web raises a number of 
challenging run-time issues: 

- How can Alice and E-Learn find out about each other’s relevant access control and 
release policies, so that they can prove that they satisfy them? 

- Given that there may be many ways that Alice can prove that she satisfies a particu- 
lar policy of E-Learn’s (by disclosing different subsets of her credentials), how can 
she decide which subset to disclose? 

- Often Alice may not have in her possession all the credentials she needs to satisfy 
one of E-Learn’s policies. For example, E-Learn may offer a discounted price for 
its French course if Alice can demonstrate that she is a student at an accredited 
university. Alice probably has her student ID in hand, but how can she automatically 
collect the necessary credentials to show that her university is accredited? 

- Traditional distributed systems security solutions (e.g., Kerberos) are centralized, 
which runs counter to the autonomous, peer-to-peer nature of the Semantic Web. 
How can we meet all the above goals without resorting to a centralized approach, 
while still guaranteeing individual autonomy to the extent possible and simultane- 
ously guaranteeing that Alice and E-Learn will be able to establish trust - i.e., that 
Alice will be able to access E-Learn’s courses - if at all possible? 

In this paper, we build upon the previous work on policy-based access control and 
release for the Semantic Web by showing how to use automated trust negotiation to 
answer these questions, as embodied in the PeerTrust approach to access control and 
information release. We start by introducing the concepts behind trust negotiation in 
section 2. We then introduce distributed logic programs to express and implement trust 
negotiation in a distributed environment, in section 3 and discuss PeerTrust’s trust nego- 
tiation using distributed logic programs in detail in section 4. We discuss related work 
in section 5 and conclude with a brief look at further research issues. 
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2 Trust Negotiation 

In traditional distributed environments, service providers and requesters are usually 
known to each other. Often shared information in the environment tells which parties 
can provide what kind of services and which parties are entitled to make use of those 
services. Thus, trust between parties is a straightforward matter. Even if on some occa- 
sions there is a trust issue, as in traditional client-server systems, the question is whether 
the server should trust the client, and not vice versa. In this case, trust establishment is 
often handled by uni-directional access control methods, such as having the client log 
in as a pre-registered user. 

In contrast, the Semantic Web provides an environment where parties may make 
connections and interact without being previously known to each other. In many cases, 
before any meaningful interaction starts, a certain level of trust must be established from 
scratch. Generally, trust is established through exchange of information between the two 
parties. Since neither party is known to the other, this trust establishment process should 
be bi-directional: both parties may have sensitive information that they are reluctant to 
disclose until the other party has proved to be trustworthy at a certain level. As there 
are more service providers emerging on the Web every day, and people are performing 
more sensitive transactions (for example, financial and health services) via the Internet, 
this need for building mutual trust will become more common. 

In the PeerTrust approach to automated trust establishment, trust is established grad- 
ually by disclosing credentials and requests for credentials, an iterative process known 
as trust negotiation. This differs from traditional identity-based access control and re- 
lease systems mainly in the following aspects: 

1. Trust between two strangers is established based on parties’ properties, which are 
proven through disclosure of digital credentials. 

2. Every party can define access control and release policies (policies, for short) to 
control outsiders’ access to their sensitive resources. These resources can include 
services accessible over the Internet, documents and other data, roles in role-based 
access control systems, credentials, policies, and capabilities in capability-based 
systems. 

3. In the approaches to trust negotiation developed so far, two parties establish trust 
directly without involving trusted third parties, other than credential issuers. Since 
both parties have policies, trust negotiation is appropriate for deployment in a peer- 
to-peer architecture, where a client and server are treated equally. Instead of a one- 
shot authorization and authentication, trust is established incrementally through a 
sequence of bilateral credential disclosures. 

A trust negotiation is triggered when one party requests to access a resource owned 
by another party. The goal of a trust negotiation is to find a sequence of credentials 
(Ci, . . . , Cfc, R), where R is the resource to which access was originally requested, 
such that when credential Ci is disclosed, its policy has been satisfied by credentials 
disclosed earlier in the sequence - or to determine that no such credential disclosure 
sequence exists. (For uniformity of terminology, we will say that R is disclosed when 
E-Learn grants Alice access to R.) 
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In practice, trust negotiation is conducted by security agents who interact with each 
other on behalf of users. A user only needs to specify policies for credentials and other 
resources. The actual trust negotiation process is fully automated and transparent to 
users. Further, the above example used objective criteria for determining whether to 
allow the requested access. More subjective criteria, such as ratings from a local or 
remote reputation monitoring service, can also be included in a policy. 

In the remainder of this paper we will show how to specify and apply policies and 
trust negotiation using distributed logic programs, building on the rule layer of the Se- 
mantic Web. Before we delve into details, though, let us highlight two general criteria 
for trust negotiation languages as well as two important features already mentioned 
briefly above. A more detailed discussion can be found in [17]. 

Well-defined semantics. Two parties must be able to agree on whether a particular set 
of credentials in a particular environment satisfies a policy. To enable this agreement, a 
policy language needs a clear, well-understood semantics. 

Expression of complex conditions. A policy language for use in trust negotiation 
needs the expressive power of a simple query language, such as relational algebra plus 
transitive closure. Such a language allows one to restrict attribute values (e.g., age must 
be over 21) and relate values occurring in different credentials (e.g., the issuer of the 
student ID must be a university that ABET has accredited). 

Sensitive policies. The information in a policy can reveal a lot about the resource that 
it protects. For example, who is allowed to see Alice’s medical record - her parole 
officer? Her psychiatrist or social worker? Because policies can contain sensitive in- 
formation, and because they may be shown to outsiders, they need to be protected like 
any other shared resource. Previous work on trust negotiation has looked at a variety of 
ways of protecting the information in policies. In this paper, we will use the protection 
scheme introduced in UniPro [21], which gives (opaque) names to policies and allows 
any named policy PI to have its own policy P2, meaning that the contents of PI can 
only be disclosed to parties who have shown that they satisfy P 2. To give flexibility 
in assigning different levels of protection to different aspects of a policy, UniPro also 
allows the definition of a policy P to refer to other policy definitions by name. 

Delegation. Trust negotiation research has also addressed the issue of delegation of 
authority. For example, rather than issuing student IDs directly, a university may dele- 
gate that authority to its registrar. Then student IDs from that university will not bear 
the digital signature of the university itself, but rather the signature of the registrar. To 
prove that Bob is a student at UIUC, then, he will have to present both his student ID 
and the (signed) policy from UIUC that delegates authority to the registrar to issue IDs. 
This level of detail will not be present in E-Learn’s policy for giving student discounts, 
which will simply say that Bob has to be a student at UIUC. If E-Learn’s policy says 
that Bob must be a student at an institution accredited by ABET, Bob faces additional 
challenges during negotiation: how can he find the credentials that show that his uni- 
versity is accredited, or conclude that no such credentials exist? Previous work on trust 
negotiation has addressed the questions of how to specify and reason about delegations 
of authority [11] and how to find credentials [12]. 
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3 Distributed Logic Programs 

3.1 Syntax 

Definite Horn clauses. PeerTrust’s language is based on first order Horn rules (definite 
Horn clauses), i.e., rules of the form 

lito < — liti, . . . , lit n 

where each liti is a positive literal Pj(ti , . . . , t n ), Pj is a predicate symbol, and the i, 
are the arguments of this predicate. Each t t is a term, i.e., a function symbol and its 
arguments, which are themselves terms. The head of a rule is lito, and its body is the 
set of liti. The body of a rule can be empty. 

Definite Horn clauses are the basis for logic programs [13], which have been used 
as the basis for the rule layer of the Semantic Web and specified in the RuleML effort 
([4,5]) as well as in the recent OWL Rules Draft [7], Definite Horn clauses can be 
easily extended to include negation as failure, restricted versions of classical negation, 
and additional constraint handling capabilities such as those used in constraint logic 
programming. Although all of these features can be useful in trust negotiation, we will 
instead focus on other more unusual required language extensions. 

Definite Horn clauses are used in the Edutella infrastructure to represent each peer’s 
knowledge about its local resources, including services, data, credentials, and the poli- 
cies for its resources. Edutella also uses a restricted form of definite Horn clauses as the 
language peers use to query one another, as well as the language used to represent query 
answers. This language is a strict superset of relational algebra. On top of this definite 
Horn clause language, we need to add some additional features, discussed in the next 
sections. 

References to other peers. The ability to reason about statements made by other peers 
is central to trust negotiation. For example, in section 2, E-Learn wants to see a state- 
ment from Alice’s employer that says that she is a police officer. One can think of this 
as a case of E-Learn delegating evaluation of the query “Is Alice a police officer?” to 
the California State Police (CSP). Once CSP receives the query, the manner in which 
CSP handles it may depend on who asked the query. Thus CSP needs a way to specify 
which peer made each request that it receives. To express delegation of evaluation to 
another peer, we extend each literal liti with an additional Authority argument, 

liti @ Authority 

where Authority specifies the peer who is responsible for evaluating liti or has the 
authority to evaluate liti. For example, E-Learn’s discount policy might mention po- 
liceOfficer(“Alice”) @ “CSP”. If that literal evaluates to true, then CSP says that Alice 
is a California police officer. As another example, a company eOrg may have a policy 
that students at UIUC are preferred customers. 

eOrg: 

preferred(X) <- student(X) @ “UIUC”. 




